PetaBricks: a language and compiler for algorithmic choice

It is often impossible to obtain a one-size-fits-all solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, data, and available system resources. In some cases, completely different algorithms may provide the best performance. Current compiler and programming language techniques are able to change some of these parameters, but today there is no simple way for the programmer to express or the compiler to choose different algorithms to handle different parts of the data. Existing solutions normally can handle only coarse-grained, library level selections or hand coded cutoffs between base cases and recursive cases. We present PetaBricks, a new implicitly parallel language and compiler where having multiple implementations of multiple algorithms to solve a problem is the natural way of programming. We make algorithmic choice a first class construct of the language. Choices are provided in a way that also allows our compiler to tune at a finer granularity. The PetaBricks compiler autotunes programs by making both fine-grained as well as algorithmic choices. Choices also include different automatic parallelization techniques, data distributions, algorithmic parameters, transformations, and blocking. Additionally, we introduce novel techniques to autotune algorithms for different convergence criteria. When choosing between various direct and iterative methods, the PetaBricks compiler is able to tune a program in such a way that delivers near-optimal efficiency for any desired level of accuracy. The compiler has the flexibility of utilizing different convergence criteria for the various components within a single algorithm, providing the user with accuracy choice alongside algorithmic choice.

[1]  Walter F. Tichy,et al.  Atune-IL: An Instrumentation Language for Auto-tuning Parallel Applications , 2009, Euro-Par.

[2]  Alan Edelman,et al.  Autotuning multigrid with PetaBricks , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[3]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[4]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[5]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[6]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  Michail G. Lagoudakis,et al.  Algorithm Selection using Reinforcement Learning , 2000, ICML.

[8]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[9]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[10]  Umar Saif,et al.  Structured decomposition of adaptive applications , 2008, Pervasive Mob. Comput..

[11]  Eric A. Brewer,et al.  High-level optimization via automated statistical modeling , 1995, PPOPP '95.

[12]  Christoph W. Kessler,et al.  A Framework for Performance-Aware Composition of Explicitly Parallel Components , 2007, PARCO.

[13]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[14]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[15]  Markus Püschel,et al.  Computer Generation of General Size Linear Transform Libraries , 2009, 2009 International Symposium on Code Generation and Optimization.

[16]  Richard H. Rand,et al.  Computer algebra in applied mathematics: An introduction to MACSYMA , 1984 .

[17]  David A. Padua,et al.  A dynamically tuned sorting library , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[18]  Dan Page,et al.  Program interpolation , 2009, PEPM '09.

[19]  Michael Voss,et al.  Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms , 2004, PDPTA.

[20]  Antoine Petitet,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005 .

[21]  Michael Voss,et al.  High-level adaptive program optimization with ADAPT , 2001, PPoPP '01.

[22]  Michael Voss,et al.  ADAPT: Automated De-coupled Adaptive Program Transformation , 2000, Proceedings 2000 International Conference on Parallel Processing.

[23]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[24]  R. C. Whaley,et al.  Automated transformation for performance-critical kernels , 2007, LCSD '07.

[25]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[26]  Nancy M. Amato,et al.  A framework for adaptive algorithm selection in STAPL , 2005, PPoPP.

[27]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[28]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[29]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[30]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[31]  David A. Padua,et al.  Optimizing sorting with genetic algorithms , 2005, International Symposium on Code Generation and Optimization.

[32]  Franz Franchetti,et al.  Operator Language: A Program Generation Framework for Fast Kernels , 2009, DSL.

[33]  Lawrence Rauchwerger,et al.  An Adaptive Algorithm Selection Framework , 2004, IEEE PACT.

[34]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[35]  James Demmel,et al.  Statistical Models for Empirical Search-Based Performance Tuning , 2004, Int. J. High Perform. Comput. Appl..

[36]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[37]  Michael F. P. O'Boyle,et al.  MILEPOST GCC: machine learning based research compiler , 2008 .

[38]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[39]  Jesper Andersson,et al.  Profile-Guided Composition , 2008, SC@ETAPS.

[40]  S. Lennart Johnsson,et al.  Scheduling FFT computation on SMP and multicore systems , 2007, ICS '07.