论文信息 - Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case

Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case

MapReduce has excellent scalability and fault-tolerance. It fits well with dominant distributed architectures of today, such as cluster or Grid, which are usually shared-nothing computing environments. However, using MapReduce for data analysis application still meets some challenges, since MapReduce is a low-level procedural programming paradigm and it does not directly support relational algebraic operators. In this work, we addressed a typical data analytic query, multiple group-by query. We parallelized the calculations involved in this type of query with MapReduce, and we introduced indexation and data partition in our work. We measured the speedup performance for implementations over both horizontally partitioned data and vertically partitioned data. We analysed the performance affecting factors from both measurement and formal estimation during this procedure.

Jie Pan | Frédéric Magoulès | Yann Le Biannic

[1] Ralf Lämmel,et al. Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[2] Ravi Kumar,et al. Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[3] Abraham Silberschatz,et al. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[4] Thomas Hofmann,et al. Map-Reduce for Machine Learning on Multicore , 2007 .

[5] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6] Karthikeyan Sankaralingam,et al. MapReduce for the Cell B.E. Architecture , 2007 .

[7] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.