Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case

MapReduce has excellent scalability and fault-tolerance. It fits well with dominant distributed architectures of today, such as cluster or Grid, which are usually shared-nothing computing environments. However, using MapReduce for data analysis application still meets some challenges, since MapReduce is a low-level procedural programming paradigm and it does not directly support relational algebraic operators. In this work, we addressed a typical data analytic query, multiple group-by query. We parallelized the calculations involved in this type of query with MapReduce, and we introduced indexation and data partition in our work. We measured the speedup performance for implementations over both horizontally partitioned data and vertically partitioned data. We analysed the performance affecting factors from both measurement and formal estimation during this procedure.