The Landscape of Parallel Computing Research: A View from Berkeley

Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

[1]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[2]  Allen Newell,et al.  The PMS and ISP descriptive systems for computer structures , 1970, AFIPS '70 (Spring).

[3]  J. Shaoul Human Error , 1973, Nature.

[4]  Carl Hewitt,et al.  A Universal Modular ACTOR Formalism for Artificial Intelligence , 1973, IJCAI.

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  J. Monaghan,et al.  Shock simulation by the particle method SPH , 1983 .

[7]  Barry H. Kantowitz,et al.  Human Factors: Understanding People-System Relationships , 1983 .

[8]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[9]  H. Massalin Superoptimizer: a look at the smallest program , 1987, ASPLOS.

[10]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Cherri M. Pancake,et al.  Do parallel languages respond to the needs of scientific programmers? , 1990, Computer.

[13]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[14]  Anantha P. Chandrakasan,et al.  Low-power CMOS digital design , 1992 .

[15]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[16]  W. Daniel Hillis,et al.  The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.

[17]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[18]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[19]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[20]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[21]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[22]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[23]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[24]  Mark Horowitz,et al.  Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[25]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[26]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[27]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[28]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[29]  Randy Goebel,et al.  Computational intelligence - a logical approach , 1998 .

[30]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[31]  Kurt Keutzer,et al.  Getting to the bottom of deep submicron , 1998, ICCAD '98.

[32]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[33]  James Demmel,et al.  A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..

[34]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[35]  Erwin A. de Kock,et al.  YAPI: application modeling for signal processing systems , 2000, Proceedings 37th Design Automation Conference.

[36]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[37]  P.P. Gelsinger,et al.  Microprocessors for the new millennium: Challenges, opportunities, and new frontiers , 2001, 2001 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC (Cat. No.01CH37177).

[38]  Albert Wang,et al.  Hardware/software instruction set configurability for system-on-chip processors , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[39]  Jeffrey S. Vetter,et al.  Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.

[40]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[41]  Dennis Sylvester,et al.  Impact of small process geometries on microarchitectures in systems on a chip , 2001 .

[42]  James Demmel,et al.  Design, implementation and testing of extended and mixed precision BLAS , 2000, TOMS.

[43]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[44]  J. Demmel,et al.  An updated set of basic linear algebra subprograms (BLAS) , 2002, TOMS.

[45]  Michael Gschwind,et al.  Optimizing pipelines for power and performance , 2002, MICRO.

[46]  John Shalf,et al.  The Cactus Framework and Toolkit: Design and Applications , 2002, VECPAR.

[47]  Norman P. Jouppi,et al.  The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[48]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[49]  Gilbert Wolrich,et al.  The next generation of Intel IXP network processors , 2002 .

[50]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[51]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[52]  James Arthur Kohl,et al.  A Component Architecture for High-Performance Computing , 2003 .

[53]  Ahmed Seffah Learning the ropes: human-centered design skills and patterns for software engineers' education , 2003, INTR.

[54]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[55]  Allan Hartstein,et al.  Optimum Power/Performance Pipeline Depth , 2003, MICRO.

[56]  Thomas R. Puzak,et al.  Optimum power/performance pipeline depth , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[57]  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[58]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[59]  Kurt Keutzer,et al.  NP-Click: a productive software development approach for network processors , 2004, IEEE Micro.

[60]  Matthias Gries,et al.  Methods for evaluating and covering the design space during early design development , 2004, Integr..

[61]  Krste Asanovic,et al.  Power-optimal pipelining in deep submicron technology , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[62]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[63]  Steve Leibson,et al.  Engineering the complex SOC : fast, flexible design with configurable processors , 2004 .

[64]  James D. Arthur,et al.  What we should teach, but don't: proposal for cross pollinated HCI-SE curriculum , 2004, 34th Annual Frontiers in Education, 2004. FIE 2004..

[65]  Bradford L. Chamberlain,et al.  The cascade high productivity language , 2004, Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings..

[66]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[67]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[68]  Laxmikant V. Kalé,et al.  Performance and modularity benefits of message-driven execution , 2004, J. Parallel Distributed Comput..

[69]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[70]  K. Keutzer,et al.  Automated Task Allocation for Network Processors , 2004 .

[71]  Jim Gray,et al.  A Minute with Nsort on a 32P NEC Windows Itanium2 Server , 2004 .

[72]  Chris Rowen,et al.  Engineering the Complex SOC , 2004 .

[73]  K. Olukotun,et al.  Transactional Memory Coherence and Consistency ( TCC ) , 2004 .

[74]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[75]  Jeffrey M. Arnold,et al.  S5: the architecture and development flow of a software configurable processor , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[76]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[77]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[78]  J. Shalf,et al.  Understanding ultra-scale application communication requirements , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[79]  G. Vahala,et al.  3D Entropic Lattice Boltzmann Simulations of 3D Navier-Stokes Turbulence , 2005 .

[80]  Steven J. Deitz,et al.  High-level programming language abstractions for advanced and dynamic parallel computations , 2005 .

[81]  P. K. Dubey,et al.  Recognition, Mining and Synthesis Moves Comp uters to the Era of Tera , 2005 .

[82]  David A. Patterson,et al.  Latency Lags Bandwidth , 2005, ICCD.

[83]  Jeffrey C. Carver,et al.  Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[84]  Leonid Oliker,et al.  Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[85]  Kunle Olukotun,et al.  ATLAS: A Scalable Emulator for Transactional Parallel Systems , 2005 .

[86]  Armando Solar-Lezama,et al.  Programming by sketching for bit-streaming programs , 2005, PLDI '05.

[87]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[88]  Rodric M. Rabbah,et al.  A Productive Programming Environment for Stream Computing , 2005 .

[89]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[90]  David A. Patterson,et al.  RAMP: research accelerator for multiple processors - a community vision for a shared experimental parallel HW/SW platform , 2006, ISPASS.

[91]  Babak Falsafi,et al.  ProtoFlex: Co-simulation for Component-wise FPGA Emulator Development , 2006 .

[92]  Li-Shiuan Peh,et al.  A Statistical Traffic Model for On-Chip Interconnection Networks , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[93]  William J. Dally,et al.  Multi-Core for HPC: breakthrough or breakdown? , 2006, SC.

[94]  Mendel Rosenblum Impact of virtualization on computer architecture and operating systems , 2006, ASPLOS XII.

[95]  Kurt Keutzer,et al.  Building ASIPs: The Mescal Methodology , 2006 .

[96]  Jonathan Rose,et al.  Measuring the Gap Between FPGAs and ASICs , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[97]  C. H. Flood,et al.  The Fortress Language Specification , 2007 .

[98]  Wi N Dows FLIGHT DATA RECORDER FOR , 2007 .

[99]  Vivek Sarkar,et al.  An Experiment in Measuring the Productivity of Three Parallel Programming Languages , 2007 .

[100]  Christoforos E. Kozyrakis,et al.  RAMP: Research Accelerator for Multiple Processors , 2007, IEEE Micro.

[101]  Edward A. Lee,et al.  The Parallel Computing Laboratory at U.C. Berkeley: A Research Agenda Based on the Berkeley View , 2008 .

[102]  R. V. D. Wijngaart NAS Parallel Benchmarks Version 2.4 , 2022 .