论文信息 - Performance of the ETL processes in terms of volume and velocity in the cloud: State of the art

Performance of the ETL processes in terms of volume and velocity in the cloud: State of the art

The ETL (Extract-Transform-Load) consists of extracting data from various sources, transforming and loading them into a place called datawarehouse. ETL is a mandatory step in the projects which implement decision-making information systems or knowledge management systems within organizations. But it is also a long and costly step in the use of human and IT resources. However, in the context of big data, characterized by 4V (Variety, Velocity, Volume and Veracity), the speed of processing has become a decisive factor in search of competitiveness. In order to facilitate the implementation of the ETL the solution is then to use the infrastructures of cloud computing whose resources in computation and storage are unlimited. This has resulted in considerable progress in terms of availability and scalability for the success of projects. But it remains a major problem: the cost can quickly become prohibitive with “pay-per-use” model of the cloud. So, in this case, how to find ETL solutions built on the cloud at a lower cost? A great deal of suggestions have been made. In this article, we have reviewed these works by highlighting the performance aspects of data processing in terms of volume and velocity.

Papa Senghane Diouf | Aliou Boly | Samba Ndiaye

[1] Willem J.J. Thompson,et al. Business intelligence in the cloud , 2010 .

[2] Anureet Kaur. Big Data : A Review of Challenges, Tools and Techniques , 2016 .

[3] Michael Stonebraker,et al. MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[4] Torben Bach Pedersen,et al. MapReduce-based Dimensional ETL Made Easy , 2012, Proc. VLDB Endow..

[5] Rajkumar Buyya,et al. Big Data Analytics = Machine Learning + Cloud Computing , 2016, ArXiv.

[6] Cao Lianchao,et al. An Efficient Data Extracting Method Based on Hadoop , 2014, CLOUD 2014.

[7] Zaia Alimazighi,et al. Une Plateforme ETL parallèle et distribuée pour l'intégration de données massives , 2015, EGC.

[8] Torben Bach Pedersen,et al. pygrametl: a powerful programming framework for extract-transform-load programmers , 2009, DOLAP.

[9] Sanjoy Kumar Saha,et al. Performance Comparison of Hadoop Based Tools with Commercial ETL Tools - A Case Study , 2013, BDA.

[10] Torben Bach Pedersen,et al. CloudETL: scalable dimensional ETL for hive , 2014, IDEAS.

[11] Xike Xie,et al. Survey of real-time processing systems for big data , 2014, IDEAS.