ZK MapReduce Primer

What is MapReduce?

In a traditional sequential computation, a single processor executes an entire program following a linear path, where each instruction is executed in the order that it appears in the code. With MapReduce and similar distributed computing paradigms, such as RDD, the computation is divided into multiple small tasks that can be executed on multiple processors in parallel.

MapReduce works by breaking down a large dataset into smaller chunks and distributing them across a cluster of machines. Broadly, this computation follows three distinct steps:

  1. Each machine performs a map operation on its chunk of data, transforming it into a set of key-value pairs.

  2. The key-value pairs generated by the map operation are then shuffled and sorted by key.

  3. The results of the shuffle are passed to a reduce operation, which aggregates the data and produces the final output.

The main advantage of MapReduce is its scalability. Since the computation is distributed across multiple machines, it can handle large-scale datasets that would be infeasible to process on a single machine. Additionally, the distributed nature of MapReduce allows computation to scale horizontally across more machines, rather than vertically in terms of computation time.

ZK For Big Data Computation:

While we've seen a proliferation of popular ZKVMs over the past year with =Nil; Foundation, RiscZero and Polygon Miden, each of them has focused on general purpose sequential execution models. In contrast, Lagrange's ZK MapReduce (ZKMR) is an extension of MapReduce that leverages recursive proofs to prove the correctness of the distributed computation over large amounts of on-chain state data.

This is achieved by generating proofs of computational correctness for each given worker during either the map or reduce steps of a distributed computation job. These proofs can be composed recursively to construct a single proof for the validity of the entire distributed computational workflow. In other words, the proofs of the smaller computations can be combined to create a proof of the entire computation. This allows ZKMR to scale more efficiently for complex computations on large data sets that require multiple steps or subcomputations.

Why Use Map Reduce?

To understand how recursive proofs can be composed, let's consider a simple example. Imagine that we want to calculate the average liquidity of an asset pair on a DEX over a given period of time. For each block, we must show the correctness of an inclusion proof on the state root, proving how much liquidity is in that DEX. Next, we must sum up all of the liquidity across every block in the average and divide it by the number of blocks.

Sequentially this computation would look as follows:

# Sequential Execution
func calculate_avg_eth_price_per_block(var mpt_paths, var liquidity_per_block){
    
    var sum = 0;
    
    for(int i=0; i<len(liquidity_per_block); i++){
        
         verify_inclusion(liquidity_per_block[i], mpt_path[i])
         sum += liquidity_per_block[i]
        
    }
    
    var avg_liquidity = sum / num_blocks

    // return the average liquidity
    return avg_liquidity
}

While this computation may intuitively appear straight forward, the performance degrades quickly as the amount of state that needs to be included increases. s.

In contrast with ZKMR, we can divide the dataset into smaller chunks and distribute them across multiple processors. Each machine will perform a map operation that calculates the Merkle-Patricia Trie (MPT) proof and sum the liquidity for its chunk of data. The map operation then generates a key-value pair where the key is a constant string (e.g., average_liquidity) and the value is a tuple containing the sum and count.

The output of the map operation will be a set of key-value pairs that can be shuffled and sorted by key, and then passed to a reduce operation. The reduce operation will receive a key-value pair where the key is average_liquidity and the value is a list of tuples containing the sum and count of the liquidity from each map operation. The reduce operation will then aggregate the data by summing the individual sums and counts, and then dividing the final sum by the final count to get the overall average liquidity.

# Map step
func map(liquidity_data_chunk,mpt_path_chunk){
    
    var sum = 0
    
    for(int i=0; i<len(liquidity_data_chunk); i++){
        
         verify_inclusion(liquidity_data_chunk[i], mpt_path_chunk[i])
         sum += liquidity_data_chunk[i]
    }
    
    return ("average_liquidity", (sum, len(liquidity_data_chunk)))
}

# Reduce step
func reduce(liquidity_data_chunk){
    
    var sum = 0
    var count = 0
    
    for(int i=0; i<len(liquidity_data_chunk); i++){
        
        sum += value[0]
        count += value[1]
    }
    
    return ("average_liquidity", sum/count)
}

Efficiency Improvements of ZKMR

While the ZKMR approach may seem more complicated than traditional sequential computation, its performance scales horizontally with respect to the number of parallel machines, rather than vertically with respect to time.

The time complexity of proving this MapReduce procedure is O(log(n)) when run with maximum parallelization, as opposed to the sequential execution which has a run-time of O(n).

As we continue to fragment state storage across scalability solutions, such as app chains, app rollups L3s and alt-L1s, the amount of on-chain data being created is only growing and fragmenting exponentially. The question may very soon become, how much data will one have to process to compute a 1-week TWAP across a DEX deployed on 100 different app rollups?

In summary, while sequential computation is highly efficient and expressive for general purpose application development, it is poorly optimized for analysis and processing of large data sets. The efficiency of distributed computing has made it the go to standard for much of the big data processing that has predominated Web2. In a zero-knowledge context, scalability and fault tolerance make it the ideal backbone for handling trustless big data applications, such as complicated on-chain pricing, volatility and liquidity analyses.

Last updated