A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

Data-intensive flows are central processes ini¾?today's business intelligence BI systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load ETL processes that populate a data warehouse DW from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today's research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.

[1]  Alberto Abelló,et al.  GEM: Requirement-Driven Generation of ETL and Multidimensional Conceptual Designs , 2011, DaWaK.

[2]  Kevin Wilkinson,et al.  QoX-driven ETL design: reducing the cost of ETL consulting engagements , 2009, SIGMOD Conference.

[3]  Juan Trujillo,et al.  A UML Based Approach for Modeling ETL Processes in Data Warehouses , 2003, ER.

[4]  S. Sudarshan,et al.  Multi-Query Optimization , 2009, Encyclopedia of Database Systems.

[5]  Kevin Wilkinson,et al.  Optimizing analytic data flows for multiple execution engines , 2012, SIGMOD Conference.

[6]  Kevin Wilkinson,et al.  HFMS: Managing the lifecycle and complexity of hybrid analytic data flows , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[7]  Matteo Golfarelli,et al.  Beyond data warehousing: what's next in business intelligence? , 2004, DOLAP '04.

[8]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  Klaus Pohl,et al.  Requirements Engineering - Fundamentals, Principles, and Techniques , 2010 .

[10]  Laura M. Haas,et al.  Beauty and the Beast: The Theory and Practice of Information Integration , 2007, ICDT.

[11]  Ladjel Bellatreche,et al.  Semantic Data Warehouse Design: From ETL to Deployment à la Carte , 2013, DASFAA.

[12]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[13]  Nicolas Bruno,et al.  Continuous Cloud-Scale Query Optimization and Processing , 2013, Proc. VLDB Endow..

[14]  Dan Suciu,et al.  Adding Structure to Unstructured Data , 1997, ICDT.

[15]  Stefan Deßloch,et al.  A Real-time Materialized View Approach for Analytic Flows in Hybrid Cloud Environments , 2014, Datenbank-Spektrum.

[16]  Torben Bach Pedersen,et al.  Using Semantic Web Technologies for Exploratory OLAP: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[17]  William H. Inmon,et al.  Mastering the SAP Business Information Warehouse , 2002 .

[18]  Gustavo Alonso,et al.  Shared Workload Optimization , 2014, Proc. VLDB Endow..

[19]  Esteban Zimányi,et al.  A BPMN-Based Design and Maintenance Framework for ETL Processes , 2013, Int. J. Data Warehous. Min..

[20]  W. H. Inmon,et al.  Dw 2.0: The Architecture for the Next Generation of Data Warehousing , 2008 .

[21]  Maurizio Lenzerini,et al.  On reconciling data exchange, data integration, and peer data management , 2007, PODS '07.

[22]  Divesh Srivastava,et al.  The Information Manifold , 1995 .

[23]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[24]  Matthias Jarke,et al.  Query Optimization in Database Systems , 1984, CSUR.

[25]  Panos Vassiliadis,et al.  Scheduling strategies for efficient ETL execution , 2013, Inf. Syst..

[26]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[27]  Alberto Abelló,et al.  Open Access Semantic Aware Business Intelligence , 2013, eBISS.

[28]  Gottfried Vossen,et al.  Schema versioning in data warehouses: Enabling cross-version querying via schema augmentation , 2006, Data Knowl. Eng..

[29]  Gottfried Vossen,et al.  Towards Self-Service Business Intelligence , 2013 .

[30]  Hamid Pirahesh,et al.  A snapshot differential refresh algorithm , 1986, SIGMOD '86.

[31]  Jose-Norberto Mazón,et al.  A survey on summarizability issues in multidimensional modeling , 2009, Data Knowl. Eng..

[32]  Robert Winter,et al.  A method for demand-driven information requirements analysis in data warehousing projects , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[33]  Alberto Abelló,et al.  A Survey of Multidimensional Modeling Methodologies , 2009, Int. J. Data Warehous. Min..

[34]  Jose-Norberto Mazón,et al.  Automatic generation of ETL processes from conceptual models , 2009, DOLAP.

[35]  Todd D. Millstein,et al.  Navigational Plans For Data Integration , 1999, AAAI/IAAI.

[36]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[37]  Timos K. Sellis,et al.  Designing Data Warehouses , 1999, Data Knowl. Eng..

[38]  Alessandro Margara,et al.  Processing flows of information: From data stream to complex event processing , 2012, CSUR.

[39]  Alberto Abelló,et al.  Quarry: Digging Up the Gems of Your Data Treasury , 2015, EDBT.

[40]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[41]  Panos Vassiliadis,et al.  Deciding the physical implementation of ETL workflows , 2007, DOLAP '07.

[42]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[43]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[44]  Sriram Padmanabhan,et al.  Determining Essential Statistics for Cost Based Optimization of an ETL Workflow , 2014, EDBT.

[45]  Kevin Wilkinson,et al.  Optimization of Analytic Data Flows for Next Generation Business Intelligence Applications , 2011, TPCTC.

[46]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[47]  Panos Vassiliadis,et al.  A generic and customizable framework for the design of ETL scenarios , 2005, Inf. Syst..

[48]  Panos Vassiliadis,et al.  Conceptual modeling for ETL processes , 2002, DOLAP '02.

[49]  Ian Horrocks,et al.  Position paper: a comparison of two modelling paradigms in the Semantic Web , 2006, WWW '06.

[50]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[51]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[52]  Jure Leskovec,et al.  Mining of Massive Datasets: MapReduce and the New Software Stack , 2014 .

[53]  Dimitrios Skoutas,et al.  Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data , 2007, Int. J. Semantic Web Inf. Syst..

[54]  Kevin Wilkinson,et al.  Engine independence for logical analytic flows , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[55]  Phokion G. Kolaitis Schema mappings, data exchange, and metadata management , 2005, PODS '05.

[56]  Andrea Calì,et al.  Query rewriting and answering under constraints in data integration systems , 2003, IJCAI.

[57]  Torben Bach Pedersen,et al.  Towards Next Generation BI Systems: The Analytical Metadata Challenge , 2014, DaWaK.

[58]  Yannis Papakonstantinou,et al.  The SQL++ Semi-structured Data Model and Query Language: A Capabilities Survey of SQL-on-Hadoop, NoSQL and NewSQL Databases , 2014, ArXiv.

[59]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[60]  Paolo Giorgini,et al.  GRAnD: A goal-oriented approach to requirement analysis in data warehouses , 2008, Decis. Support Syst..

[61]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[62]  Kevin Wilkinson,et al.  Leveraging Business Process Models for ETL Design , 2010, ER.

[63]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[64]  Oscar Romero,et al.  DSS from an RE Perspective: A systematic mapping , 2016, J. Syst. Softw..

[65]  Hector Garcia-Molina,et al.  Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.

[66]  Volker Markl,et al.  Situational Business Intelligence , 2008, BIRTE.

[67]  Alberto Abelló,et al.  Incremental Consolidation of Data-Intensive Multi-Flows , 2016, IEEE Transactions on Knowledge and Data Engineering.

[68]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[69]  George Papastefanatos,et al.  Policy-Regulated Management of ETL Evolution , 2009, J. Data Semant..

[70]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[71]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[72]  Matteo Golfarelli,et al.  Data Warehouse Design: Modern Principles and Methodologies , 2009 .

[73]  Felix Wortmann,et al.  An architecture for ad-hoc and collaborative business intelligence , 2010, EDBT '10.

[74]  Ronald Fagin,et al.  Probabilistic data exchange , 2011, J. ACM.

[75]  Kevin Wilkinson,et al.  Managing operational business intelligence workloads , 2009, OPSR.

[76]  Alexandra Poulovassilis,et al.  Data integration by bi-directional schema transformation rules , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[77]  Wolfgang Lehner,et al.  POIESIS: a Tool for Quality-aware ETL Process Redesign , 2015, EDBT.

[78]  Keith Warfield,et al.  Technology readiness levels , 2016 .

[79]  Esteban Zimányi,et al.  BPMN-Based Conceptual Modeling of ETL Processes , 2012, DaWaK.

[80]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[81]  Beate List,et al.  Striving towards Near Real-Time Data Integration for Data Warehouses , 2002, DaWaK.

[82]  Robert Wrembel,et al.  On querying versions of multiversion data warehouse , 2004, DOLAP '04.

[83]  Dimitrios Skoutas,et al.  Designing ETL processes using semantic web technologies , 2006, DOLAP '06.

[84]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[85]  Andrea Calì,et al.  Data integration under integrity constraints , 2004, Inf. Syst..

[86]  Vincent Y. Lum,et al.  EXPRESS: a data EXtraction, Processing, and Restructuring System , 1977, TODS.

[87]  Ryan Wisnesky,et al.  Orchid: Integrating Schema Mapping and ETL , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[88]  Laura M. Haas,et al.  Clio: Schema Mapping Creation and Data Exchange , 2009, Conceptual Modeling: Foundations and Applications.

[89]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[90]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[91]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[92]  Maik Thiele,et al.  On-Demand ELT Architecture for Right-Time BI: Extending the Vision , 2013, Int. J. Data Warehous. Min..

[93]  Panos Vassiliadis A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[94]  Jordi Torres,et al.  Adaptive MapReduce Scheduling in Shared Environments , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[95]  Georgia Kougka,et al.  Practical algorithms for execution engine selection in data flows , 2015, Future Gener. Comput. Syst..

[96]  Panos Vassiliadis,et al.  Near Real Time ETL , 2009, New Trends in Data Warehousing and Data Analysis.

[97]  Felix Naumann,et al.  SOFA: An extensible logical optimizer for UDF-heavy data flows , 2015, Inf. Syst..

[98]  Abraham Bernstein,et al.  A survey of intelligent assistants for data analysis , 2013, CSUR.

[99]  Leonid Libkin,et al.  Data exchange and incomplete information , 2006, PODS '06.

[100]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..