On-Demand ELT Architecture for Right-Time BI: Extending the Vision

In a typical BI infrastructure, data, extracted from operational data sources, is transformed, cleansed, and loaded into a data warehouse by a periodic ETL process, typically executed on a nightly basis, i.e., a full day's worth of data is processed and loaded during off-hours. However, it is desirable to have fresher data for business insights at near real-time. To this end, the authors propose to leverage a data warehouse's capability to directly import raw, unprocessed records and defer the transformation and data cleaning until needed by pending reports. At that time, the database's own processing mechanisms can be deployed to process the data on-demand. Event-processing capabilities are seamlessly woven into our proposed architecture. Besides outlining an overall architecture, the authors also developed a roadmap for implementing a complete prototype using conventional database technology in the form of hierarchical materialized views.

[1]  D. Woolley,et al.  The white paper , 1943, Public Health.

[2]  Umeshwar Dayal,et al.  The architecture of an active database management system , 1989, SIGMOD '89.

[3]  DayalUmeshwar,et al.  The architecture of an active database management system , 1989 .

[4]  WidomJennifer,et al.  View maintenance in a warehousing environment , 1995 .

[5]  Alejandro P. Buchmann,et al.  Building an integrated active OODBMS: requirements, architecture, and design decisions , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[6]  Hector Garcia-Molina,et al.  Applying update streams in a soft real-time database system , 1995, SIGMOD '95.

[7]  Jennifer Widom,et al.  View maintenance in a warehousing environment , 1995, SIGMOD '95.

[8]  LibkinLeonid,et al.  Algorithms for deferred view maintenance , 1996 .

[9]  Latha S. Colby,et al.  Algorithms for deferred view maintenance , 1996, SIGMOD '96.

[10]  Yue Zhuge,et al.  The Strobe algorithms for multi-source warehouse consistency , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[11]  Jennifer Widom,et al.  On-line warehouse view maintenance , 1997, SIGMOD '97.

[12]  Ambuj K. Singh,et al.  Efficient view maintenance at data warehouses , 1997, SIGMOD '97.

[13]  Inderpal Singh Mumick,et al.  Maintenance of data cubes and summary tables in a warehouse , 1997, SIGMOD '97.

[14]  WidomJennifer,et al.  On-line warehouse view maintenance , 1997 .

[15]  Jeffrey F. Naughton,et al.  Materialized View Selection for Multidimensional Datasets , 1998, VLDB.

[16]  Antoni Wolski,et al.  Lazy Aggregates for Real-Time OLAP , 1999, DaWaK.

[17]  Ashish Gupta,et al.  Materialized views: techniques, implementations, and applications , 1999 .

[18]  Marcus Costa Sampaio,et al.  Efficient materialization and use of views in data warehouses , 1999, SGMD.

[19]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[20]  Bruce G. Lindsay,et al.  How to roll a join: asynchronous incremental view maintenance , 2000, SIGMOD '00.

[21]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[22]  Umeshwar Dayal,et al.  Business Process Coordination: State of the Art, Trends, and Open Issues , 2001, VLDB.

[23]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[24]  Laks V. S. Lakshmanan,et al.  Efficient OLAP Query Processing in Distributed Data Warehouses , 2002, EDBT.

[25]  Beate List,et al.  Striving towards Near Real-Time Data Integration for Data Warehouses , 2002, DaWaK.

[26]  Johann Eder,et al.  The COMET Metamodel for Temporal Data Warehouses , 2002, CAiSE.

[27]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[28]  Michael Stonebraker,et al.  Aurora: a data stream management system , 2003, SIGMOD '03.

[29]  Michael H. Böhlen,et al.  Efficient OLAP query processing in distributed data warehouses , 2002, Proceedings 18th International Conference on Data Engineering.

[30]  Robert Wrembel,et al.  Designing Storage Structures for Management of Materialised Methods in Object-Oriented Databases , 2003, OOIS.

[31]  Reynold Cheng,et al.  Maintaining Temporal Consistency of Discrete Objects in Soft Real-Time Database Systems , 2003, IEEE Trans. Computers.

[32]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[33]  Opher Etzion,et al.  Complex event processing , 2004, Proceedings. IEEE International Conference on Web Services, 2004..

[34]  Dimitri Theodoratos,et al.  Constructing search spaces for materialized view selection , 2004, DOLAP '04.

[35]  Robert Wrembel,et al.  Prototype system for method materialisation and maintenance in object-oriented databases , 2004, SAC '04.

[36]  Christof Bornhövd,et al.  Dealing with heterogeneous data in pub/sub systems : the concept-based approach , 2004, ICSE 2004.

[37]  Zhimin Chen,et al.  Efficient computation of multiple group by queries , 2005, SIGMOD '05.

[38]  Inderpal Singh Mumick,et al.  Selection of views to materialize in a data warehouse , 1997, IEEE Transactions on Knowledge and Data Engineering.

[39]  Timos K. Sellis,et al.  Optimizing ETL processes in data warehouses , 2005, 21st International Conference on Data Engineering (ICDE'05).

[40]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[41]  Marcin Gorawski,et al.  Fault-Tolerant Distributed Stream Processing System , 2006, 17th International Workshop on Database and Expert Systems Applications (DEXA'06).

[42]  Robert Wrembel,et al.  Dynamic Method Materialization: A Framework for Optimizing Data Access Via Methods , 2006, DEXA.

[43]  Yanlei Diao,et al.  High-performance complex event processing over streams , 2006, SIGMOD Conference.

[44]  Hicham G. Elmongui,et al.  Lazy Maintenance of Materialized Views , 2007, VLDB.

[45]  Songchun Moon,et al.  Concurrent View Maintenance Scheme for Soft Real-time Data Warehouse Systems , 2007, J. Inf. Sci. Eng..

[46]  Martin Ahrens,et al.  Classification Of 3G Mobile Phone Customers , 2007, Int. J. Data Warehous. Min..

[47]  Wolfgang Lehner,et al.  Partition-based workload scheduling in living data warehouse environments , 2007, DOLAP '07.

[48]  Marcin Gorawski,et al.  Towards Stream Data Parallel Processing in Spatial Aggregating Index , 2007, PPAM.

[49]  Panos Vassiliadis,et al.  Towards a Benchmark for ETL Workflows , 2007, QDB.

[50]  Panos Vassiliadis,et al.  Deciding the physical implementation of ETL workflows , 2007, DOLAP '07.

[51]  Panos Vassiliadis,et al.  Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.

[52]  H. Rahman Social and Political Implications of Data Mining: Knowledge Management in E-Government , 2008 .

[53]  Marcin Gorawski,et al.  Towards Automated Analysis of Connections Network in Distributed Stream Processing System , 2008, DASFAA.

[54]  Esteban Zimányi,et al.  Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications , 2010 .

[55]  Jorge Bernardino,et al.  Real-time data warehouse loading methodology , 2008, IDEAS '08.

[56]  Torben Bach Pedersen,et al.  RiTE: Providing On-Demand Data for Right-Time Data Warehousing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[57]  Stefan Deßloch,et al.  Towards generating ETL processes for incremental loading , 2008, IDEAS '08.

[58]  Andrew Rau-Chaplin,et al.  Dynamic View Selection for OLAP , 2008, Int. J. Data Warehous. Min..

[59]  Jeffrey F. Naughton,et al.  Transaction reordering with application to synchronized scans , 2008, DOLAP '08.

[60]  Matteo Golfarelli,et al.  Data Warehouse Design: Modern Principles and Methodologies , 2009 .

[61]  Pedro Furtado A Survey of Parallel and Distributed Data Warehouses , 2009, Int. J. Data Warehous. Min..

[62]  Vivekanand Gopalkrishnan,et al.  Epsilon Equitable Partition: On Scheduling Data Loading and View Maintenance in Soft Real-time Data Warehouses , 2009, COMAD.

[63]  Wolfgang Lehner,et al.  Multi-objective scheduling for real-time data warehouses , 2009, Computer Science - Research and Development.

[64]  Mohammad Taghi Hajiaghayi,et al.  Scheduling to minimize staleness and stretch in real-time data warehouses , 2009, SPAA '09.

[65]  Michael Schrefl,et al.  Active and Real-Time Data Warehousing , 2009, Encyclopedia of Database Systems.

[66]  Wolfgang Lehner,et al.  Evaluation of Load Scheduling Strategies for Real-Time Data Warehouse Environments , 2009, BIRTE.

[67]  Daniel Pol,et al.  Principles for an ETL Benchmark , 2009, TPCTC.

[68]  Umeshwar Dayal,et al.  Benchmarking ETL Workflows , 2009, TPCTC.

[69]  Jovanka Adzic,et al.  Extraction, Transformation, and Loading Processes , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[70]  Panos Vassiliadis,et al.  Near Real Time ETL , 2009, New Trends in Data Warehousing and Data Analysis.

[71]  Stefan Deßloch,et al.  Near Real-Time Data Warehousing Using State-of-the-Art ETL Tools , 2009, BIRTE.

[72]  Stefan Deßloch,et al.  Formalizing ETL Jobs for Incremental Loading of Data Warehouses , 2009, BTW.

[73]  David Taniar,et al.  Strategic Advancements in Utilizing Data Mining and Warehousing Technologies: New Concepts and Developments , 2009, Strategic Advancements in Utilizing Data Mining and Warehousing Technologies.

[74]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[75]  Jorge Bernardino,et al.  24/7 Real-Time Data Warehousing: A Tool for Continuous Actionable Knowledge , 2011, 2011 IEEE 35th Annual Computer Software and Applications Conference.

[76]  Li Liu,et al.  Decision Rule Extraction for Regularized Multiple Criteria Linear Programming Model , 2011, Int. J. Data Warehous. Min..

[77]  Yu Xu,et al.  A Hadoop based distributed loading approach to parallel data warehouses , 2011, SIGMOD '11.

[78]  Mohammad Taghi Hajiaghayi,et al.  Scheduling to Minimize Staleness and Stretch in Real-Time Data Warehouses , 2009, SPAA '09.

[79]  Marcin Gorawski,et al.  Optimization of operator partitions in stream data warehouse , 2011, DOLAP '11.

[80]  Pradeep Kumar,et al.  Pattern Discovery Using Sequence Data Mining: Applications and Studies , 2011 .

[81]  Surajit Chaudhuri,et al.  An overview of business intelligence technology , 2011, Commun. ACM.

[82]  Robert Wrembel,et al.  RTDW-bench: Benchmark for Testing Refreshing Performance of Real-Time Data Warehouse , 2012, DEXA.

[83]  Maribel Yasmina Santos,et al.  Spatial Clustering in SOLAP Systems to Enhance Map Visualization , 2012, Int. J. Data Warehous. Min..

[84]  Alejandro P. Buchmann,et al.  ACTrESS: automatic context transformation in event-based software systems , 2012, DEBS.

[85]  A. Buchmann,et al.  Federated Objects : A Transformation Approach ? , 2012 .

[86]  Tugba Taskaya Temizel,et al.  A Framework to Detect Disguised Missing Data , 2013 .

[87]  Best Practices for Real-time Data Warehousing , 2014 .

[88]  M. Gholamian International Journal of Data Warehousing and Mining , 2014 .