HAWQ: a massively parallel processing SQL engine in hadoop

HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.

[1]  Ian Rae,et al.  F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[2]  Yu Li,et al.  Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse , 2011, SIGMOD '11.

[3]  David J. DeWitt,et al.  Weaving Relations for Cache Performance , 2001, VLDB.

[4]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[5]  Daniel J. Abadi,et al.  Column oriented Database Systems , 2009, Proc. VLDB Endow..

[6]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[7]  Jim Gray,et al.  A critique of ANSI SQL isolation levels , 1995, SIGMOD '95.

[8]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[9]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[10]  Liang Lin,et al.  Tenzing a SQL implementation on the MapReduce framework , 2011, Proc. VLDB Endow..

[11]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[12]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[13]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14]  GraefeGoetz Encapsulation of parallelism in the Volcano query processing system , 1990 .

[15]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[16]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[17]  Goetz Graefe,et al.  Encapsulation of parallelism in the Volcano query processing system , 1990, SIGMOD '90.

[18]  Garret Swart,et al.  Oracle in-database hadoop: when mapreduce meets RDBMS , 2012, SIGMOD Conference.

[19]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[21]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[22]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[23]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[26]  Abraham Silberschatz,et al.  Efficient processing of data warehousing queries in a split execution environment , 2011, SIGMOD '11.

[27]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[28]  Dan R. K. Ports,et al.  Serializable Snapshot Isolation in PostgreSQL , 2012, Proc. VLDB Endow..

[29]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.