MapReduce-Based Parallel Algorithms for Multidimensionnal Data Analysis

MapReduce has excellent scalability and fault-tolerance mechanism. It fits well with the cheap commodity hardware. Today, using MapReduce to answer data analytical query is an attractive topic. In this work, we introduce Multiple Group-by query processing. Our processing of this query is based on MapReduce model, a new parallel computing model coming from Cloud Computing. A pre-processing phase is performed for fitting MapReduce's data accessing and improving data accessibility. We give different MapReduce job definitions in order to process data set partitioned in different partitioning methods. We evaluate our query's processing on top of a cluster of Grid'5000. We also address performance issues since they are very important in software industry to integrate a new technology. We analyze the measured results and discover several factors which impact the response time. At the end of this work, we propose a new data structure which allows more flexible job-scheduling.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[3]  Jie Pan,et al.  Executing Multiple Group by Query Using MapReduce Approach: Implementation and Optimization , 2010, GPC.

[4]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[5]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[6]  Jie Pan,et al.  Executing multiple group-by query in a MapReduce approach , 2010, 2010 Second International Conference on Communication Systems, Networks and Applications.

[7]  Jie Pan,et al.  Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case , 2010, HPDC '10.

[8]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[9]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[10]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[11]  G. Antoshenkov,et al.  Byte-aligned bitmap compression , 1995, Proceedings DCC '95 Data Compression Conference.

[12]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Fei Teng,et al.  A New Game Theoretical Resource Allocation Algorithm for Cloud Computing , 2010, GPC.

[15]  Andrew Rau-Chaplin,et al.  Parallel multi-dimensional ROLAP indexing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[16]  Christopher Olston,et al.  Parallel Evaluation of Composite Aggregate Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Jie Pan,et al.  Introduction to Grid Computing , 2009 .