Incremental Consolidation of Data-Intensive Multi-Flows

Business intelligence (BI) systems depend on efficient integration of disparate and often heterogeneous data. The integration of data is governed by data-intensive flows and is driven by a set of information requirements. Designing such flows is in general a complex process, which due to the complexity of business environments is hard to be done manually. In this paper, we deal with the challenge of efficient design and maintenance of data-intensive flows and propose an incremental approach, namely CoAl , for semi-automatically consolidating data-intensive flows satisfying a given set of information requirements. CoAl works at the logical level and consolidates data flows from either high-level information requirements or platform-specific programs. As CoAl integrates a new data flow, it opts for maximal reuse of existing flows and applies a customizable cost model tuned for minimizing the overall cost of a unified solution. We demonstrate the efficiency and effectiveness of our approach through an experimental evaluation using our implemented prototype.

[1]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[2]  Panos Vassiliadis,et al.  Conceptual modeling for ETL processes , 2002, DOLAP '02.

[3]  Panos Kalnis,et al.  Multi-query optimization for on-line analytical processing , 2003, Inf. Syst..

[4]  Panos Vassiliadis,et al.  A generic and customizable framework for the design of ETL scenarios , 2005, Inf. Syst..

[5]  Thomas Neumann Query Optimization (in Relational Databases) , 2009, Encyclopedia of Database Systems.

[6]  Kevin Wilkinson,et al.  Engine independence for logical analytic flows , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[7]  Kevin Wilkinson,et al.  BabbleFlow: a translator for analytic data flow programs , 2014, SIGMOD Conference.

[8]  Wolfgang Lehner,et al.  Multi-flow Optimization via Horizontal Message Queue Partitioning , 2010, ICEIS.

[9]  Alberto Abelló,et al.  Quarry: Digging Up the Gems of Your Data Treasury , 2015, EDBT.

[10]  Panos Vassiliadis,et al.  A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[11]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[12]  Gustavo Alonso,et al.  Shared Workload Optimization , 2014, Proc. VLDB Endow..

[13]  Ralph Hughes Agile Data Warehousing: Delivering World-Class Business Intelligence Systems Using Scrum and XP , 2008 .

[14]  Alberto Abelló,et al.  Integrating ETL Processes from Information Requirements , 2012, DaWaK.

[15]  Ryan Wisnesky,et al.  Orchid: Integrating Schema Mapping and ETL , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Laura M. Haas,et al.  Clio: Schema Mapping Creation and Data Exchange , 2009, Conceptual Modeling: Foundations and Applications.

[17]  Arnon Rosenthal,et al.  Data Integration in the Large: The Challenge of Reuse , 1994, VLDB.

[18]  Kevin Wilkinson,et al.  QoX-driven ETL design: reducing the cost of ETL consulting engagements , 2009, SIGMOD Conference.

[19]  Ladjel Bellatreche,et al.  Semantic Data Warehouse Design: From ETL to Deployment à la Carte , 2013, DASFAA.

[20]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[21]  Marcelo Arenas,et al.  Foundations of schema mapping management , 2010, PODS '10.

[22]  Esteban Zimányi,et al.  A BPMN-Based Design and Maintenance Framework for ETL Processes , 2013, Int. J. Data Warehous. Min..

[23]  Alberto Abelló,et al.  GEM: Requirement-Driven Generation of ETL and Multidimensional Conceptual Designs , 2011, DaWaK.

[24]  Felix Naumann,et al.  METL: Managing and Integrating ETL Processes , 2009, VLDB PhD Workshop.

[25]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..