Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system

Dynamic runtimes can simplify parallel programming by automatically managing concurrency and locality without further burdening the programmer. Nevertheless, implementing such runtime systems for large-scale, shared-memory systems can be challenging. This work optimizes Phoenix, a MapReduce runtime for shared-memory multi-cores and multiprocessors, on a quad-chip, 32-core, 256-thread UltraSPARC T2+ system with NUMA characteristics. We show how a multi-layered approach that comprises optimizations on the algorithm, implementation, and OS interaction leads to significant speedup improvements with 256 threads (average of 2.5× higher speedup, maximum of 19×). We also identify the roadblocks that limit the scalability of parallel runtimes on shared-memory systems, which are inherently tied to the OS scalability on large-scale systems.

[1]  Adrian Schüpbach,et al.  Embracing diversity in the Barrelfish manycore operating system , 2008 .

[2]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Stephen Phillips VictoriaFalls: Scaling highly-threaded processor cores , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[4]  Jonathan Harris,et al.  Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[5]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[6]  Karthikeyan Sankaralingam,et al.  MapReduce for the Cell Broadband Engine Architecture , 2009, IBM J. Res. Dev..

[7]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Christoph Lameter,et al.  Local and Remote Memory: Memory in a Linux/NUMA System , 2006 .

[10]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[11]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[12]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[13]  Samuel Thibault,et al.  An Efficient OpenMP Runtime System for Hierarchical Arch , 2007, IWOMP.

[14]  Jeffrey K. Hollingsworth,et al.  NUMA-aware Java heaps for server applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[15]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[16]  Karthikeyan Sankaralingam,et al.  MapReduce for the Cell B.E. Architecture , 2007 .

[17]  Eduard Ayguadé,et al.  Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models , 2004, International Journal of Parallel Programming.

[18]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.