Optimizing MapReduce for GPUs with effective shared memory usage

Accelerators and heterogeneous architectures in general, and GPUs in particular, have recently emerged as major players in high performance computing. For many classes of applications, MapReduce has emerged as the framework for easing parallel programming and improving programmer productivity. There have already been several efforts on implementing MapReduce on GPUs. In this paper, we propose a new implementation of MapReduce for GPUs, which is very effective in utilizing shared memory, a small programmable cache on modern GPUs. The main idea is to use a reduction-based method to execute a MapReduce application. The reduction-based method allows us to carry out reductions in shared memory. To support a general and efficient implementation, we support the following features: a memory hierarchy for maintaining the reduction object, a multi-group scheme in shared memory to trade-off space requirements and locking overheads, a general and efficient data structure for the reduction object, and an efficient swapping mechanism. We have evaluated our framework with seven commonly used MapReduce applications and compared it with the sequential implementations, MapCG, a recent MapReduce implementation on GPUs, and Ji et al.'s work, a recent MapReduce implementation that utilizes shared memory in a different way. The main observations from our experimental results are as follows. For four of the seven applications that can be considered as reduction-intensive applications, our framework has a speedup of between 5 and 200 over MapCG (for large datasets). Similarly, we achieved a speedup of between 2 and 60 over Ji et al.'s work.

[1]  Patrick R. Amestoy,et al.  High Performance Computing for Computational Science - VECPAR 2008 , 2008, Lecture Notes in Computer Science.

[2]  Benjamin Rose,et al.  CellMR: A framework for supporting mapreduce on asymmetric cell-based clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[3]  Uday Bondhugula,et al.  Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.

[4]  Christoforos E. Kozyrakis,et al.  Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Emilio L. Zapata,et al.  Memory Locality Exploitation Strategies for FFT on the CUDA Architecture , 2008, VECPAR.

[6]  Satoshi Matsuoka,et al.  Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[7]  Wenguang Chen,et al.  MapCG: Writing parallel program portable between CPU and GPU , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Randal E. Bryant,et al.  Data-Intensive Supercomputing: The case for DISC , 2007 .

[9]  Roy H. Campbell,et al.  MITHRA: Multiple data independent tasks on a heterogeneous resource architecture , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[11]  Majid Sarrafzadeh,et al.  A memory optimization technique for software-managed scratchpad memory in GPUs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[12]  Wu-chun Feng,et al.  StreamMR: An Optimized MapReduce Framework for AMD GPUs , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[13]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[14]  Kurt Keutzer,et al.  A map reduce framework for programming graphics processors , 2010 .

[15]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[18]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[19]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  Arlo Faria,et al.  MapReduce : Distributed Computing for Machine Learning , 2006 .

[22]  Feng Ji,et al.  Using Shared Memory to Accelerate MapReduce on Graphics Processing Units , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.