The challenges of extract, transform and load (ETL) for data integration in near real-time environment

Organization with considerable investment into data warehousing, the influx of various data types and forms require certain ways of prepping data and staging platform that support fast, efficient and volatile data to reach its targeted audiences or users of different business needs. Extract, Transform and Load (ETL) system proved to be a choice standard for managing and sustaining the movement and transactional process of the valued big data assets. However, traditional ETL system can no longer accommodate and effectively handle streaming or near real-time data and stimulating environment which demands high availability, low latency and horizontal scalability features for functionality. This paper identifies the challenges of implementing ETL system for streaming or near real-time data which needs to evolve and streamline itself with the different requirements. Current efforts and solution approaches to address the challenges are presented. The classification of ETL system challenges are prepared based on near real-time environment features and ETL stages to encourage different perspectives for future research.

[1]  Karthikeyan Ponnalagu,et al.  Goal-Driven Context-Aware Data Filtering in IoT-Based Systems , 2015, 2015 IEEE 18th International Conference on Intelligent Transportation Systems.

[2]  Torben Bach Pedersen,et al.  SETL: A programmable semantic extract-transform-load framework for semantic data warehouses , 2017, Inf. Syst..

[3]  Divyakant Agrawal,et al.  Multi-representation Based Data Processing Architecture for IoT Applications , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[4]  Sonal Sharma,et al.  Modeling ETL Process for Data Warehouse: An Exploratory Study , 2014, 2014 Fourth International Conference on Advanced Computing & Communication Technologies.

[5]  Andrew Rau-Chaplin,et al.  A distributed tree data structure for real-time OLAP on cloud architectures , 2013, 2013 IEEE International Conference on Big Data.

[6]  N. Karthikeyan,et al.  From Data Warehouses to Streaming Warehouses: A Survey on the Challenges for Real-Time Data Warehousing and Available Solutions , 2013 .

[7]  Ralph Kimball,et al.  The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence , 2010 .

[8]  Arif Nurwidyantoro,et al.  Cassandra and SQL database comparison for near real-time Twitter data warehouse , 2016, 2016 International Seminar on Intelligent Technology and Its Applications (ISITIA).

[9]  Koji Zettsu,et al.  An editable live ETL system for Ambient Intelligence environments , 2015, 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT).

[10]  Torben Bach Pedersen,et al.  Towards a Programmable Semantic Extract-Transform-Load Framework for Semantic Data Warehouses , 2015, DOLAP.

[11]  Lida Xu,et al.  An Integrated System for Regional Environmental Monitoring and Management Based on Internet of Things , 2014, IEEE Transactions on Industrial Informatics.

[12]  Janis Zuters Near Real-Time Data Warehousing with Multi-stage Trickle and Flip , 2011, BIR.

[13]  Ardianto Wibowo,et al.  Problems and available solutions on the stage of Extract, Transform, and Loading in near real-time data warehousing (a literature study) , 2015, 2015 International Seminar on Intelligent Technology and Its Applications (ISITIA).

[14]  Byron Ellis Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data , 2014 .

[15]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling , 1996 .

[16]  Vasilis Vassalos,et al.  Semi-Streamed Index Join for near-real time execution of ETL transformations , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[17]  Mira Kim,et al.  Integration of Big Data Using Semantic Web Technologies , 2016, 2016 IEEE Tenth International Conference on Semantic Computing (ICSC).

[18]  Manoj Kumar,et al.  Modeling and optimization of extraction-transformation-loading (ETL) processes in data warehouse: An overview , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[19]  Fiaz Majeed,et al.  Efficient data streams processing in the real time data warehouse , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[20]  Pragma Kar,et al.  A Comparative Review of Data Warehousing ETL Tools with New Trends and Industry Insight , 2017, 2017 IEEE 7th International Advance Computing Conference (IACC).

[21]  Chen Lin,et al.  Maintaining Internal Consistency of Report for Real-Time OLAP with Layer-Based View , 2011, APWeb.

[22]  Carlos Roberto Valêncio,et al.  Real Time Delta Extraction Based on Triggers to Support Data Warehousing , 2013, 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[23]  Prayag Tiwari Advanced ETL (AETL) by integration of PERL and scripting method , 2016, 2016 International Conference on Inventive Computation Technologies (ICICT).

[24]  Shivani Saluja,et al.  Refreshing Datawarehouse in Near Real-Time , 2012 .

[25]  Erum Mehmood,et al.  Optimization of cache-based semi-stream joins , 2017, 2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[26]  Xiaofang Li,et al.  Real-Time data ETL framework for big real-time data analysis , 2015, 2015 IEEE International Conference on Information and Automation.

[27]  P. Anu Priya,et al.  Entity resolution for high velocity streams using semantic measures , 2015, 2015 IEEE International Advance Computing Conference (IACC).