Parallelism by design: data analysis with sawzall

Very large data sets - telephone call records, network logs, high-resolution satellite images, or web document repositories - are not easily analyzed using traditional database techniques. They may be simply too large, grow too fast, or may not fit well in a database schema. They tend to span multiple disks and machines. On the other hand, these large data sets often have a flat and regular structure that permits distributed filtering and aggregation. We present a system and language for such analyses*. Altering phase, in which a query is expressed using the procedural programming language Sawzall, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The language constructs and execution model of Sawzall have been devised to enable parallel execution without the need for complex dependency analysis. Even with our fairly traditional implementation of the Sawzall execution engine we observe nearly perfect scalability as we add more machines. *Joint work with Sean Dorward, Rob Pike, and Sean Quinlan.