Tenzing a SQL implementation on the MapReduce framework

Tenzing is a query engine built on top of MapReduce [9] for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries.

[1]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[2]  GhemawatSanjay,et al.  The Google file system , 2003 .

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[5]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[6]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[7]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[8]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[9]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[10]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[11]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[12]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[13]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[14]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[15]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[16]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[17]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[18]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[19]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[20]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[21]  D. DeWitt MapReduce: A major step backwards | The Database Column , 2011 .