论文信息 - Octopus: Hybrid Big Data Integration Engine

Octopus: Hybrid Big Data Integration Engine

Nowadays large enterprises maintain a huge amount of data in multiple backend systems including traditional database systems and recently popular big data systems. In an example of telecom providers, the key business data (e.g., billing information) is maintained in database systems whereas the huge signaling log data is on HDFS with Hive. How to integrate such data and provide a consolidate query and analytic becomes a challenging task. Neither traditional database warehouse nor recent Big Data system (e.g. Apache Spark and Hadoop) can fully leverage the power of each backend system. In this paper, we build a hybrid data processing engine, called Octopus, to fully integrate backend systems. Given the backend systems, data is distributed at multiple locations. Octopus focuses on the optimization of the amount of data movement. To this end, Octopus proposes a technique of query pushdown for such optimization. A proof-of-concept prototype of Octopus successfully verifies that Octopus can achieve much faster running time than Spark. For example, Octopus outperforms the recent Spark version 1.4.0 by 5.25 X faster running time to process an aggregation query.

Hong Min | Weixiong Rao | Chenyang Xu | Yanjie Chen | Gong Su

[1] Scott Shenker,et al. Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[2] Hakan Hacigümüs,et al. MISO: souping up big data query processing with a multistore system , 2014, SIGMOD Conference.

[3] Fei Wang,et al. OceanST: A Distributed Analytic System for Large-Scale Spatiotemporal Mobile Broadband Data , 2014, Proc. VLDB Endow..

[4] Ioana Manolescu,et al. Invisible Glue: Scalable Self-Tunning Multi-Stores , 2015, CIDR.

[5] Joseph K. Bradley,et al. Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[6] Matei A. Zaharia,et al. An Architecture for and Fast and General Data Processing on Large Clusters , 2016 .

[7] Alon Y. Halevy,et al. Principles of Data Integration , 2012 .

[8] Steven Hand,et al. Musketeer: all for one, one for all in data processing systems , 2015, EuroSys.

[9] Ameet Talwalkar,et al. MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[10] Dorit S. Hochba,et al. Approximation Algorithms for NP-Hard Problems , 1997, SIGA.