TORCH Computational Reference Kernels - A Testbed for Computer Science Research

For decades, computer scientists have sought guidance on how to evolve architectures, languages, and programming models in order to improve application performance, efficiency, and productivity. Unfortunately, without overarching advice about future directions in these areas, individual guidance is inferred from the existing software/hardware ecosystem, and each discipline often conducts their research independently assuming all other technologies remain fixed. In today's rapidly evolving world of on-chip parallelism, isolated and iterative improvements to performance may miss superior solutions in the same way gradient descent optimization techniques may get stuck in local minima. To combat this, we present TORCH: A Testbed for Optimization ResearCH. These computational reference kernels define the core problems of interest in scientific computing without mandating a specific language, algorithm, programming model, or implementation. To compliment the kernel (problem) definitions, we provide a set of algorithmically-expressed verification tests that can be used to verify a hardware/software co-designed solution produces an acceptable answer. Finally, to provide some illumination as to how researchers have implemented solutions to these problems in the past, we provide a set of reference implementations in C and MATLAB.

[1]  Samuel Williams,et al.  Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[2]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[3]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[4]  David H. Bailey,et al.  Random Generators and Normal Numbers , 2002, Exp. Math..

[5]  Martin Aigner,et al.  Sorting by insertion of leading elements , 1987, J. Comb. Theory, Ser. A.

[6]  Jeffrey Scott Vitter,et al.  Efficient sorting using registers and caches , 2000, JEAL.

[7]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[9]  David A. Bader,et al.  Approximating Betweenness Centrality , 2007, WAW.

[10]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[11]  C. Pomerance,et al.  Prime Numbers: A Computational Perspective , 2002 .

[12]  Ulrich Meyer,et al.  A computational study of external-memory BFS algorithms , 2006, SODA '06.

[13]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[14]  S. McCormick,et al.  A multigrid tutorial (2nd ed.) , 2000 .

[15]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[16]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[17]  Glenn Reinman,et al.  ParallAX: an architecture for real-time physics , 2007, ISCA '07.

[18]  Torsten Hoefler,et al.  A space-efficient parallel algorithm for computing betweenness centrality in distributed memory , 2010, 2010 International Conference on High Performance Computing.

[19]  Katherine A. Yelick,et al.  Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[20]  Guy E. Blelloch,et al.  An Experimental Analysis of Parallel Sorting Algorithms , 1998, Theory of Computing Systems.

[21]  J. Demmel,et al.  A TESTING INFRASTRUCTURE FOR LAPACK ’ S SYMMETRIC EIGENSOLVERS , 2007 .

[22]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[23]  Derek G. Corneil,et al.  Parallel computations in graph theory , 1975, 16th Annual Symposium on Foundations of Computer Science (sfcs 1975).

[24]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[25]  David A. Bader Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems , 2006 .

[26]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[27]  David A. Bader,et al.  Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[28]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[29]  Matemática,et al.  Society for Industrial and Applied Mathematics , 2010 .

[30]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[31]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[32]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[33]  Samuel Williams,et al.  A Kernel Testbed for Parallel Architecture, Language, and Performance Research , 2010 .

[34]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[35]  W. M. Gentleman,et al.  Fast Fourier Transforms: for fun and profit , 1966, AFIPS '66 (Fall).

[36]  Vipin Kumar,et al.  Scalable parallel formulations of the barnes-hut method for n-body simulations , 1994, Supercomputing '94.

[37]  Samuel Williams,et al.  Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[38]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[39]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[40]  David B. Yoffie,et al.  Intel Corporation 2005 , 2005 .

[41]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[42]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[43]  Sirpa Mäki Formulas for computing ξA , 1980 .

[44]  J. Anthonisse The rush in a directed graph , 1971 .

[45]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[46]  Alan George,et al.  QR Factorization of a Dense Matrix on a Hypercube Multiprocessor , 1990, SIAM J. Sci. Comput..

[47]  Peter Sanders,et al.  Better Approximation of Betweenness Centrality , 2008, ALENEX.

[48]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[49]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[50]  David Eppstein,et al.  Fast approximation of centrality , 2000, SODA '01.

[51]  David A. Bader,et al.  National Laboratory Lawrence Berkeley National Laboratory Title A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets Permalink , 2009 .

[52]  Mihalis Yannakakis,et al.  High-probability parallel transitive closure algorithms , 1990, SPAA '90.

[53]  Richard P. Martin,et al.  Fast parallel sorting under logp: from theory to practice , 1993 .

[54]  Richard E. Crandall,et al.  Large-scale FFTs and convolutions on Apple hardware , 2008 .

[55]  David A. Bader,et al.  BioPerf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[56]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[57]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[58]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[59]  Jon Louis Bentley,et al.  Engineering a sort function , 1993, Softw. Pract. Exp..

[60]  Becky Verastegui,et al.  Proceedings of the 2007 ACM/IEEE conference on Supercomputing , 2007, HiPC 2007.

[61]  Samuel Williams,et al.  Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[62]  Ulrik Brandes,et al.  On variants of shortest-path betweenness centrality and their generic computation , 2008, Soc. Networks.

[63]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[64]  David H. Bailey,et al.  Performance results for two of the NAS parallel benchmarks , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[65]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[66]  Jonathan M. Borwein,et al.  Advances in the theory of box integrals , 2010, Math. Comput..

[67]  Philip S. Yu,et al.  CellSort: High Performance Sorting on the Cell Processor , 2007, VLDB.

[68]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[69]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[70]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[71]  David H. Bailey A High-Performance FFT Algorithm for Vector Supercomputers , 1987, PPSC.

[72]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[73]  Jeffrey Scott Vitter,et al.  A Simple and Efficient Parallel Disk Mergesort , 2002, Theory of Computing Systems.

[74]  David A. Bader,et al.  Practical parallel algorithms for personalized communication and integer sorting , 1996, JEAL.

[75]  Edward A. Lee,et al.  The Parallel Computing Laboratory at U.C. Berkeley: A Research Agenda Based on the Berkeley View , 2008 .

[76]  Fabrizio Petrini,et al.  Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[77]  Richard E. Crandall Prime numbers : a computational perspective / Richard Crandall and Carl Pomerance , 2005 .

[78]  Janez Brest,et al.  A sorting algorithm on a PC cluster , 2000, SAC '00.

[79]  Edmond Chow,et al.  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[80]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[81]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[82]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[83]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[84]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[85]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[86]  Jeffrey Scott Vitter,et al.  Efficient Sorting Using Registers and Caches , 2000, Algorithm Engineering.

[87]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[88]  Bülent Abali,et al.  Balanced Parallel Sort on Hypercube Multiprocessors , 1993, IEEE Trans. Parallel Distributed Syst..

[89]  Michael J. Quinn,et al.  Parallel graph algorithms , 1984, CSUR.

[90]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[91]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[92]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .