PipeGen: Data Pipe Generator for Hybrid Analytics

As the number of big data management systems continues to grow, users increasingly seek to leverage multiple systems in the context of a single data analysis task. To efficiently support such hybrid analytics, we develop a tool called PipeGen for efficient data transfer between database management systems (DBMSs). PipeGen automatically generates data pipes between DBMSs by leveraging their functionality to transfer data via disk files using common data formats such as CSV. PipeGen creates data pipes by extending such functionality with efficient binary data transfer capabilities that avoid file system materialization, include multiple important format optimizations, and transfer data in parallel when possible. We evaluate our PipeGen prototype by generating 20 data pipes automatically between five different DBMSs. The results show that PipeGen speeds up data transfer by up to 3.8× as compared to transferring using disk files.

[1]  Ángel Viña,et al.  The Denodo Data Integration Platform , 2002, VLDB.

[2]  Sumit Gulwani,et al.  Test-driven synthesis , 2014, PLDI.

[3]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[4]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[5]  Dan Suciu,et al.  Demonstration of the Myria big data management service , 2014, SIGMOD Conference.

[6]  Michal Maciejewski,et al.  Structure finding in cosmological simulations: the state of affairs , 2013, 1304.0585.

[7]  William W. Cohen,et al.  Power Iteration Clustering , 2010, ICML.

[8]  Michael Stonebraker,et al.  A Demonstration of the BigDAWG Polystore System , 2015, Proc. VLDB Endow..

[9]  Tim Bray,et al.  Internet Engineering Task Force (ietf) the Javascript Object Notation (json) Data Interchange Format , 2022 .

[10]  Michael Stonebraker,et al.  Data transformation and migration in polystores , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[11]  Dominique Portal,et al.  A cooperation system for heterogeneous data base management systems , 1978, Inf. Syst..

[12]  Shivnath Babu,et al.  How to Fit when No One Size Fits , 2013, CIDR.

[13]  Shigeru Chiba,et al.  Load-Time Structural Reflection in Java , 2000, ECOOP.

[14]  Yu Li,et al.  Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse , 2011, SIGMOD '11.

[15]  Aws Albarghouthi,et al.  MapReduce program synthesis , 2016, PLDI.

[16]  Erez Zadok,et al.  Rapid file system development using ptrace , 2007, ExpCS '07.

[17]  Laurie Hendren,et al.  Soot: a Java bytecode optimization framework , 2010, CASCON.

[18]  Laxmikant V. Kalé,et al.  Massively parallel cosmological simulations with ChaNGa , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[19]  A. F. Cardenas,et al.  Heterogeneous distributed database management: The HD-DBMS , 1987, Proceedings of the IEEE.

[20]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[21]  Ian Goldberg,et al.  A Secure Environment for Untrusted Helper Applications ( Confining the Wily Hacker ) , 1996 .

[22]  Tore Risch,et al.  Functional Data Integration in a Distributed Mediator System , 2004 .

[23]  Ioana Manolescu,et al.  Toward Scalable Hybrid Stores , 2015, SEBD.

[24]  Laura M. Haas,et al.  Clio grows up: from research prototype to industrial tool , 2005, SIGMOD '05.

[25]  Philip J. Guo,et al.  CDE: Using System Call Interposition to Automatically Create Portable Software Packages , 2011, USENIX Annual Technical Conference.

[26]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[27]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[28]  Ahmed K. Elmagarmid,et al.  Distributed Operation Language for Specification and Processing of Multidatabase Applications , 1988 .

[29]  David A. Wagner,et al.  A Secure Environment for Untrusted Helper Applications , 1996, USENIX Security Symposium.

[30]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[31]  Steven Hand,et al.  Musketeer: all for one, one for all in data processing systems , 2015, EuroSys.

[32]  Eric Bouillet,et al.  Extending a general-purpose streaming system for XML , 2012, EDBT '12.

[33]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[34]  Paolo Papotti,et al.  Rheem: Enabling Multi-Platform Task Execution , 2016, SIGMOD Conference.

[35]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[36]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[37]  Sumit Gulwani,et al.  Compositional Program Synthesis from Natural Language and Examples , 2015, IJCAI.

[38]  David Walker,et al.  From dirt to shovels: fully automatic tool generation from ad hoc data , 2008, POPL '08.

[39]  Eddie Kohler,et al.  Making information flow explicit in HiStar , 2006, OSDI '06.

[40]  Peter-Michael Osera,et al.  Type-and-example-directed program synthesis , 2015, PLDI.

[41]  Isil Dillig,et al.  Synthesizing data structure transformations from input-output examples , 2015, PLDI.

[42]  Bipin C. Desai,et al.  MDAS: heterogeneous distributed database management system , 1992, Inf. Softw. Technol..

[43]  Marjorie Templeton,et al.  InterViso: Dealing with the complexity of federated database access , 2005, The VLDB Journal.

[44]  Garret Swart,et al.  Oracle in-database hadoop: when mapreduce meets RDBMS , 2012, SIGMOD Conference.

[45]  Laura M. Haas,et al.  Garlic: a new flavor of federated query processing for DB2 , 2002, SIGMOD '02.

[46]  Zohra Bellahsene,et al.  PORSCHE: Performance ORiented SCHEma mediation , 2008, Inf. Syst..