A Principled Kernel Testbed for Hardware/Software Co-Design Research

A Principled Kernel Testbed for Hardware/Software Co-Design Research Alex Kaiser, Samuel Williams, Kamesh Madduri, Khaled Ibrahim, David Bailey, James Demmel, Erich Strohmaier Computational Research Division Lawrence Berkeley National Laboratory Abstract Recently, advances in processor architecture have be- come the driving force for new programming models in the computing industry, as ever newer multicore proces- sor designs with increasing number of cores are intro- duced on schedules regimented by marketing demands. As a result, collaborative parallel (rather than simply concurrent) implementations of important applications, programming languages, models, and even algorithms have been forced to adapt to these architectures to exploit the available raw performance. We believe that this op- timization regime is flawed. In this paper, we present an alternate approach that, rather than starting with an ex- isting hardware/software solution laced with hidden as- sumptions, defines the computational problems of inter- est and invites architects, researchers and programmers to implement novel hardware/software co-designed solu- tions. Our work builds on the previous ideas of compu- tational dwarfs, motifs, and parallel patterns by selecting a representative set of essential problems for which we provide: An algorithmic description; scalable problem definition; illustrative reference implementations; veri- fication schemes. This testbed will enable comparative research in areas such as parallel programming mod- els, languages, auto-tuning, and hardware/software co- design. For simplicity, we focus initially on the compu- tational problems of interest to the scientific computing community but proclaim the methodology (and perhaps a subset of the problems) as applicable to other commu- nities. We intend to broaden the coverage of this problem space through stronger community involvement. Introduction For decades, computer scientists have sought guidance on how to evolve architectures, languages, and program- ming models in order to improve application perfor- mance, efficiency, and productivity. Unfortunately, with- out an overarching direction, individual guidance is in- ferred from the existing software/hardware ecosystem, and each group often conducts their research indepen- dently assuming all other technologies remain fixed. Ar- chitects attempt to provide micro-architectural solutions to improve performance on fixed binaries. Researchers tweak compilers to improve code generation for exist- ing architectures and implementations, and they may in- vent new programming models for fixed processor and memory architectures and computational algorithms. In today’s rapidly evolving world of on-chip parallelism, these isolated and iterative improvements to performance may miss superior solutions in the same way gradient descent optimization techniques may get stuck in local minima. To combat this tunnel vision, previous work set forth a broad categorization of numerical methods of interest to the scientific computing community (the seven Dwarfs) and subsequently for the larger parallel computing com- munity in general (13 motifs), suggesting that these were the problems of interest that researchers should focus on [1, 2, 9]. Unfortunately, such broad brush strokes of- ten miss the nuance seen in individual kernels that may be similarly categorized. For example, the computational requirements of particle methods vary greatly between the naive but more accurate direct calculations and the particle-mesh and particle-tree codes. In this paper, we present an alternate methodology for testbed creation. For simplicity we restricted our domain to scientific computing. Superficially, this is reminis- cent of the computational kernels in Intel’s RMS work [12]. However, we proceed in a more regimented effort. We commence with the enumeration of problems, pro- ceed by providing not only reference implementations for each problem, but more importantly a mathematical definition that allows one to escape iterative approaches to software/hardware optimization. To ensure long term value, we augment each with both a scalable problem generator and a verification scheme. By no means is the

[1]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[2]  David B. Yoffie,et al.  Intel Corporation 2005 , 2005 .

[3]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[4]  J. Demmel,et al.  A TESTING INFRASTRUCTURE FOR LAPACK ’ S SYMMETRIC EIGENSOLVERS , 2007 .

[5]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Samuel Williams,et al.  A design methodology for domain-optimized power-efficient supercomputing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Samuel Williams,et al.  Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[8]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[9]  Volume Assp,et al.  ACOUSTICS. SPEECH. AND SIGNAL PROCESSING , 1983 .

[10]  David A. Bader,et al.  BioPerf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[11]  Samuel Williams,et al.  Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12]  David A. Bader Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems , 2006 .

[13]  Edward A. Lee,et al.  The Parallel Computing Laboratory at U.C. Berkeley: A Research Agenda Based on the Berkeley View , 2008 .

[14]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[15]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[16]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[17]  Glenn Reinman,et al.  ParallAX: an architecture for real-time physics , 2007, ISCA '07.

[18]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[19]  Samuel Williams,et al.  Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[21]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[22]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[23]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[24]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[25]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[26]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[27]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..