Software challenges in extreme scale systems

Computer systems anticipated in the 2015 - 2020 timeframe are referred to as Extreme Scale because they will be built using massive multi-core processors with 100's of cores per chip. The largest capability Extreme Scale system is expected to deliver Exascale performance of the order of 10 18 operations per second. These systems pose new critical challenges for software in the areas of concurrency, energy eciency and resiliency. In this paper, we discuss the implications of the concurrency and energy eciency challenges on future software for Extreme Scale Systems. From an application viewpoint, the concurrency and energy challenges boil down to the ability to express and manage parallelism and locality by exploring a range of strong scaling and new-era weak scaling techniques. For expressing parallelism and locality, the key challenges are the ability to expose all of the intrinsic parallelism and locality in a programming model, while ensuring that this expression of parallelism and locality is portable across a range of systems. For managing parallelism and locality, the OS-related challenges include parallel scalability, spatial partitioning of OS and application functionality, direct hardware access for inter-processor communication, and asynchronous rather than interrupt-driven events, which are accompanied by runtime system challenges for scheduling, synchronization, memory management, communication, performance monitoring, and power management. We conclude by discussing the importance of software-hardware co- design in addressing the fundamental challenges for application enablement on Extreme Scale systems.

[1]  Carl Wunsch,et al.  Practical global oceanic state estimation , 2007 .

[2]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[3]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Franz Franchetti,et al.  Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform , 2006, SC.

[5]  Vivek Sarkar,et al.  Multi-core Implementations of the Concurrent Collections Programming Model , 2008 .

[6]  Shujia Zhou,et al.  Application controlled parallel asynchronous IO , 2006, SC.

[7]  Michael Metcalf,et al.  Fortran 90 Explained , 1990 .

[8]  Nathan R. Tallent,et al.  Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.

[9]  Tong Li,et al.  Efficient operating system scheduling for performance-asymmetric multi-core architectures , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[10]  Charles A. Zukowski,et al.  CMOS transistor sizing for minimization of energy-delay product , 1996, Proceedings of the Sixth Great Lakes Symposium on VLSI.

[11]  Keshav Pingali,et al.  Compiler research: the next 50 years , 2009, CACM.

[12]  Rajeev Thakur,et al.  Formal verification of practical MPI programs , 2009, PPoPP '09.

[13]  T. Inglett,et al.  Designing a Highly-Scalable Operating System: The Blue Gene/L Story , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[14]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[15]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[16]  John Glauert,et al.  SISAL: streams and iteration in a single assignment language. Language reference manual, Version 1. 2. Revision 1 , 1985 .

[17]  Samuel Lang,et al.  GIGA+: scalable directories for shared file systems , 2007, PDSW '07.

[18]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[19]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[20]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[21]  John Shalf,et al.  Cactus Framework: Black Holes to Gamma Ray Bursts , 2007, ArXiv.

[22]  Jonathan Walpole,et al.  Introducing technology into the Linux kernel: a case study , 2008, OPSR.

[23]  Leonid Oliker,et al.  Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[24]  Ian T. Foster,et al.  Distant I/O: one-sided access to secondary storage on remote processors , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[25]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[26]  Jack B. Dennis,et al.  Data Flow Supercomputers , 1980, Computer.

[27]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[28]  John A. Gunnels,et al.  Petascale computing with accelerators , 2009, PPoPP '09.

[29]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[30]  James R. Larus,et al.  Transactional Memory , 2006, Transactional Memory.

[31]  Tao Yang,et al.  The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[32]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[33]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[34]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[35]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[36]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[37]  Vivek Sarkar,et al.  Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.

[38]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[39]  Anwar Ghuloum Ct: channelling NeSL and SISAL in C++ , 2007, CUFP '07.

[40]  Vivek Sarkar,et al.  Phaser accumulators: A new reduction construct for dynamic parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[41]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[42]  Ronald Minnich,et al.  Right-weight kernels: an off-the-shelf alternative to custom light-weight kernels , 2006, OPSR.

[43]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[44]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[45]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[46]  Ani Thakar Lessons Learned from the SDSS Catalog Archive Server , 2008, Computing in Science & Engineering.

[47]  Nathan R. Tallent,et al.  Binary analysis for measurement and attribution of program performance , 2009, PLDI '09.

[48]  Bryan Veal,et al.  Performance scalability of a multi-core web server , 2007, ANCS '07.

[49]  Yu Ma,et al.  Empowering distributed workflow with the data capacitor: maximizing lustre performance across the wide area network , 2007, SOCP '07.

[50]  Seth Copen Goldstein,et al.  Retrospective: active messages: a mechanism for integrating computation and communication , 1998, ISCA '98.

[51]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[52]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[53]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[54]  Vivek Sarkar,et al.  Automatic selection of high-order transformations in the IBM XL FORTRAN compilers , 1997, IBM J. Res. Dev..

[55]  Vivek Sarkar,et al.  Chunking parallel loops in the presence of synchronization , 2009, ICS.