论文信息 - Interpreting the data: Parallel analysis with Sawzall

Interpreting the data: Parallel analysis with Sawzall

Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.

[1] Jon Louis Bentley,et al. Programming pearls: little languages , 1986, CACM.

[2] Jon Louis Bentley,et al. Programming pearls , 1987, CACM.

[3] William F. Clocksin,et al. Programming in Prolog , 1987, Springer Berlin Heidelberg.

[4] Alfred V. Aho,et al. The awk programming language , 1988 .

[5] David R. Hanson. Fast allocation and deallocation of memory based on object lifetimes , 1990, Softw. Pract. Exp..

[6] David M. Beazley,et al. Python Essential Reference , 1999 .

[7] Anne Rogers,et al. Hancock: a language for extracting signatures from data streams , 2000, KDD '00.

[8] Sanjeev Khanna,et al. Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[9] Michael Stonebraker,et al. Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[10] GhemawatSanjay,et al. The Google file system , 2003 .

[11] Moses Charikar,et al. Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[12] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13] Daniel M. Roy,et al. Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[14] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[15] Douglas Thain,et al. Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[16] PikeRob,et al. Interpreting the data , 2005 .