Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

[1]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[2]  George C. Caragea,et al.  Orca: a modular query optimizer architecture for big data , 2014, SIGMOD Conference.

[3]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[4]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[5]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[6]  Michael Stonebraker,et al.  "One size fits all": an idea whose time has come and gone , 2018, Making Databases Work.

[7]  S. B. Yao,et al.  Optimization Algorithms for Distributed Queries , 1986, IEEE Transactions on Software Engineering.

[8]  Campbell Fraser,et al.  Enhancements to SQL server column stores , 2013, SIGMOD '13.

[9]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[10]  Xuedong Chen,et al.  The Star Schema Benchmark and Augmented Fact Table Indexing , 2009, TPCTC.

[11]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[12]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[13]  Inderpal Singh Mumick,et al.  Incremental Maintenance Of Views With Duplicates , 1999 .

[14]  Alan Wood,et al.  Adaptive Statistics in Oracle 12c , 2017, Proc. VLDB Endow..

[15]  Daniel Lemire,et al.  Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources , 2018, SIGMOD Conference.

[16]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[17]  Anurag Gupta,et al.  Amazon Redshift and the Case for Simpler Data Warehouses , 2015, SIGMOD Conference.

[18]  Ioana Manolescu,et al.  Invisible Glue: Scalable Self-Tunning Multi-Stores , 2015, CIDR.

[19]  Ashish Motivala,et al.  The Snowflake Elastic Data Warehouse , 2016, SIGMOD Conference.

[20]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[21]  Michael J. Cahill Serializable isolation for snapshot databases , 2009, TODS.

[22]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[23]  Inderpal Singh Mumick,et al.  Selection of Views to Materialize in a Data Warehouse , 2005, IEEE Trans. Knowl. Data Eng..

[24]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[25]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[26]  Yuan Yuan,et al.  Major technical advancements in apache hive , 2014, SIGMOD Conference.

[27]  Sam Lightstone,et al.  DB2 Design Advisor: Integrated Automatic Physical Database Design , 2004, VLDB.

[28]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[29]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[30]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[31]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[32]  Ippokratis Pandis,et al.  Impala: Eine moderne, quellen-offene SQL Engine für Hadoop , 2016 .

[33]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[34]  Ioana Manolescu,et al.  Reuse-based Optimization for Pig Latin , 2016, CIKM.

[35]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[36]  Laura M. Haas,et al.  Towards heterogeneous multimedia information systems: the Garlic approach , 1995, Proceedings RIDE-DOM'95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management.

[37]  Jonathan Goldstein,et al.  Optimizing queries using materialized views: a practical, scalable solution , 2001, SIGMOD '01.

[38]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[39]  Kyuseok Shim,et al.  Optimizing queries with materialized views , 1995, Proceedings of the Eleventh International Conference on Data Engineering.