Systems research challenges: A scale-out perspective

A scale-out system is a collection of interconnected, modular, low-cost computers that work as a single entity to cooperatively provide applications, systems resources, and data to users. The dominant programming model for such systems consists of message passing at the systems level and multithreading at the element level. Scale-out computers have traditionally been developed and deployed to provide levels of performance (throughput and parallel processing) beyond what was achievable by large shared-memory computers that utilized the fastest processors and the most expensive memory systems. Today, exploiting scale-out at all levels in systems is becoming imperative in order to overcome a fundamental discontinuity in the development of microprocessor technology caused by power dissipation. The pervasive use of greater levels of scale-out, on the other hand, creates its own challenges in architecture, programming, systems management, and reliability. This position paper identifies some of the important research problems that must be addressed in order to deal with the technology disruption and fully realize the opportunity offered by scale-out. Our examples are based on parallelism, but the challenges we identify apply to scale-out more generally.

[1]  Cleve B. Moler,et al.  Numerical computing with MATLAB , 2004 .

[2]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[3]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[4]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[5]  Petr Jan Horn,et al.  Autonomic Computing: IBM's Perspective on the State of Information Technology , 2001 .

[6]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[7]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[8]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[9]  David A. Padua,et al.  On the Automatic Parallelization of the Perfect Benchmarks , 1998, IEEE Trans. Parallel Distributed Syst..

[10]  Donna N. Dillenberger,et al.  Adaptive Algorithms for Managing a Distributed Data Processing Workload , 1997, IBM Syst. J..

[11]  Jong-Deok Choi,et al.  Whole-Stack Analysis and Optimization of Commercial Workloads on Server Systems , 2004, NPC.

[12]  Joefon Jann,et al.  Dynamic reconfiguration: Basic building blocks for autonomic computing on IBM pSeries servers , 2003, IBM Syst. J..

[13]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[14]  Robert S. Germain,et al.  Blue Matter, an application framework for molecular simulation on Blue Gene , 2003, J. Parallel Distributed Comput..

[15]  Philip Heidelberger,et al.  Early Experience with Scientific Applications on the Blue Gene/L Supercomputer , 2005, Euro-Par.

[16]  Mark S. Squillante,et al.  Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling , 1993, IEEE Trans. Parallel Distributed Syst..

[17]  Matthew Arnold,et al.  Adaptive optimization in the Jalapeño JVM , 2000, OOPSLA '00.

[18]  Samuel P. Midkiff,et al.  Quicksilver: a quasi-static compiler for Java , 2000, OOPSLA '00.

[19]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[20]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[21]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[22]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[23]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[24]  Bronis R. de Supinski,et al.  Scaling physics and material science applications on a massively parallel Blue Gene/L system , 2005, ICS '05.

[25]  Asit Dan,et al.  Web Services Differentiation with Service Level Agreements , 2003 .