paper “Parallel Analysis with Sawzall” “If scaling were perfect, performance would be proportional to the number of machines... In our test, the effect is to contribute 0.98 machines.” Translation: Not 100% linear but 98% of linear or C++, while capable of handling such tasks, are more awkward to use and require more effort on the part of the programmer. Still, Awk and Python are not panaceas; for instance, they have no inherent facilities for processing data on multiple machines. Since the data records we wish to process do live on many machines, it would be fruitful to exploit the combined computing power to perform these analyses. In particular, if the individual steps can be expressed as query operations that can be evaluated one record at a time, we can distribute the calculation across all the machines and achieve very high throughput. The results of these operations will then require an aggregation phase. For example, if we are counting records, we need to gather the counts from the individual machines before we can report the total count. We therefore break our calculations into two phases. The first phase evaluates the analysis on each record individually, while the second phase aggregates the results (Figure 2). The system described in this paper goes even further, however. The analysis in the first phase is expressed in a new procedural programming language that executes one record at a time, in isolation, to calculate query results for each record. The second phase is restricted to a set of predefined aggregators that process the intermediate results generated by the first phase. By restricting the calculations to this model, we can achieve very high throughput. Although not all calculations fit this model well, the ability to harness a thousand or more machines with a few lines of code provides some compensation. !""#$"%&'#( !"#$%&'#%()*$(' +,-$$%&'&.$.' ! ! )*+&$#,( /.0'&.$.' Figure 2: The overall flow of filtering, aggregating, and collating. Each stage typically involves less data than the previous. Of course, there are still many subproblems that remain to be solved. The calculation must be divided into pieces and distributed across the machines holding the data, keeping the computation as near the data as possible to avoid network bottlenecks. And when there are many machines there is a high probability of some of them failing during the analysis, so the system must be 3 Theo Schlossnagle: “Linear scaling is simply a falsehood” p.71 Scalability is a function Not a number Always limits, e.g., throughput capacity Want to quantify such limits c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 10 / 45