Performance of the ETL processes in terms of volume and velocity in the cloud: State of the art

The ETL (Extract-Transform-Load) consists of extracting data from various sources, transforming and loading them into a place called datawarehouse. ETL is a mandatory step in the projects which implement decision-making information systems or knowledge management systems within organizations. But it is also a long and costly step in the use of human and IT resources. However, in the context of big data, characterized by 4V (Variety, Velocity, Volume and Veracity), the speed of processing has become a decisive factor in search of competitiveness. In order to facilitate the implementation of the ETL the solution is then to use the infrastructures of cloud computing whose resources in computation and storage are unlimited. This has resulted in considerable progress in terms of availability and scalability for the success of projects. But it remains a major problem: the cost can quickly become prohibitive with “pay-per-use” model of the cloud. So, in this case, how to find ETL solutions built on the cloud at a lower cost? A great deal of suggestions have been made. In this article, we have reviewed these works by highlighting the performance aspects of data processing in terms of volume and velocity.

[1]  Willem J.J. Thompson,et al.  Business intelligence in the cloud , 2010 .

[2]  Anureet Kaur Big Data : A Review of Challenges, Tools and Techniques , 2016 .

[3]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[4]  Torben Bach Pedersen,et al.  MapReduce-based Dimensional ETL Made Easy , 2012, Proc. VLDB Endow..

[5]  Rajkumar Buyya,et al.  Big Data Analytics = Machine Learning + Cloud Computing , 2016, ArXiv.

[6]  Cao Lianchao,et al.  An Efficient Data Extracting Method Based on Hadoop , 2014, CLOUD 2014.

[7]  Zaia Alimazighi,et al.  Une Plateforme ETL parallèle et distribuée pour l'intégration de données massives , 2015, EGC.

[8]  Torben Bach Pedersen,et al.  pygrametl: a powerful programming framework for extract-transform-load programmers , 2009, DOLAP.

[9]  Sanjoy Kumar Saha,et al.  Performance Comparison of Hadoop Based Tools with Commercial ETL Tools - A Case Study , 2013, BDA.

[10]  Torben Bach Pedersen,et al.  CloudETL: scalable dimensional ETL for hive , 2014, IDEAS.

[11]  Xike Xie,et al.  Survey of real-time processing systems for big data , 2014, IDEAS.

[12]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[13]  Kaiyuan Qi,et al.  An Efficient Data Extracting Method Based on Hadoop , 2014, CloudComp.

[14]  Xiufeng Liu,et al.  An ETL optimization framework using partitioning and parallelization , 2015, SAC.

[15]  Z. Alimazighi,et al.  Big-ETL : Extracting-Transforming-Loading Approach for Big Data , 2015 .

[16]  孙傲冰,et al.  A New ETL Approach Based on Data Virtualization , 2015 .

[17]  Seref Sagiroglu,et al.  Big data: A review , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[18]  Zaia Alimazighi,et al.  PF-ETL : vers l'intégration de données massives dans les fonctionnalités d'ETL , 2014, INFORSID.

[19]  Barrie Sosinsky,et al.  Cloud Computing Bible , 2010 .

[20]  Torben Bach Pedersen,et al.  ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce , 2011, DaWaK.