Expression Tree Evaluation by Dynamic Code Generation - Are Accelerators Up for the Task?

Dynamic code generation techniques are useful if the benefit of code specialized to values only known at runtime outweighs generation time. Such techniques are increasingly employed for HPC applications to tune their runtime behavior. The simulation software investigated in this paper is a typical example: It spends a significant portion of computing time evaluating symbolic formulas which are set up dynamically from model data. However, any software tuning has to match the hardware. Due to the so-called power wall, HPC systems are increasingly equipped with throughput-oriented accelerator components, to allow for rising performance as known from Top500 history. To best exploit such systems, it is important to understand how well applications map to heterogeneous components. While dynamic code generation can work well for standard multi-core systems, in this paper, we research the benefit of accelerators for this scenario. For our application we show that - while the generated code runs well on the accelerator - the generation itself has serious issues, and much better maps to standard multi-cores. Therefore, we see the need that coming HPC systems still have to be equipped with a significant portion of latency-oriented, thus complex general-purpose hardware.

[1]  R. Bartlett,et al.  Coupled-cluster methods that include connected quadruple excitations, T4: CCSDTQ-1 and Q(CCSDT) , 1989 .

[2]  Mihály Kállay,et al.  Coupled-cluster methods including noniterative corrections for quadruple excitations. , 2005, The Journal of chemical physics.

[3]  R. Bartlett,et al.  The coupled‐cluster single, double, and triple excitation model for open‐shell single reference functions , 1990 .

[4]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[5]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[6]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[7]  P. Deuflhard Newton Methods for Nonlinear Problems: Affine Invariance and Adaptive Algorithms , 2011 .

[8]  J. Hammond,et al.  Coupled‐Cluster Calculations for Large Molecular and Extended Systems , 2011 .

[9]  R. Bartlett Coupled-cluster approach to molecular structure and spectra: a step toward predictive quantum chemistry , 1989 .

[10]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[11]  Dominik Grewe,et al.  Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation , 2011, GPGPU-4.

[12]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[13]  Thomas Müller,et al.  Convergence behaviour of coupled pressure and thermal networks , 2014 .

[14]  Tjalling J. Ypma,et al.  Historical Development of the Newton-Raphson Method , 1995, SIAM Rev..

[15]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[16]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[17]  R. Bartlett,et al.  The full CCSDT model for molecular electronic structure , 1987 .

[18]  Michael Steffen Oliver Franz,et al.  Code_generation On_the_fly: a Key to Portable Software , 1994 .

[19]  R. Bartlett,et al.  Recursive intermediate factorization and complete computational linearization of the coupled-cluster single, double, triple, and quadruple excitation equations , 1991 .

[20]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21]  N. Oliphant,et al.  Coupled‐cluster method truncated at quadruples , 1991 .

[22]  Joel H. Saltz,et al.  An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications , 1995, IEEE Trans. Parallel Distributed Syst..

[23]  Sven Leyffer,et al.  Heuristic static load-balancing algorithm applied to the fragment molecular orbital method , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  M. Head‐Gordon,et al.  A fifth-order perturbation comparison of electron correlation theories , 1989 .

[25]  S. J. Cole,et al.  Towards a full CCSDT model for electron correlation , 1985 .

[26]  Scott B. Baden,et al.  Run-Time Support for Multi-tier Programming of Block-Structured Applications on SMP Clusters , 1997, ISCOPE.

[27]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[28]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[29]  Paolo Bientinesi,et al.  Performance Modeling for Dense Linear Algebra , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[30]  R. Bartlett,et al.  Coupled‐cluster open‐shell analytic gradients: Implementation of the direct product decomposition approach in energy gradient calculations , 1991 .

[31]  R. Bartlett,et al.  A direct product decomposition approach for symmetry exploitation in many-body methods. I. Energy calculations , 1991 .

[32]  John Aycock,et al.  A brief history of just-in-time , 2003, CSUR.

[33]  R. Bartlett,et al.  An efficient way to include connected quadruple contributions into the coupled cluster method , 1998 .

[34]  Sriram Krishnamoorthy,et al.  Load Balancing of Dynamical Nucleation Theory Monte Carlo Simulations through Resource Sharing Barriers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[35]  J. Stanton Why CCSD(T) works: a different perspective , 1997 .

[36]  R. Bartlett,et al.  A full coupled‐cluster singles and doubles model: The inclusion of disconnected triples , 1982 .

[37]  Sriram Krishnamoorthy,et al.  Performance characterization of global address space applications: a case study with NWChem , 2012, Concurr. Comput. Pract. Exp..

[38]  Guy L. Steele Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO , 1977, ACM '77.

[39]  S. Hirata Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[40]  T. Crawford,et al.  An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .

[41]  Don W. Warren,et al.  An analysis of a logical machine using parenthesis-free notation , 1954 .

[42]  Joseph Edwards An Elementary Treatise on the Differential Calculus: With Applications and Numerous Examples , 2010 .

[43]  Sally A. McKee,et al.  Performance optimization by dynamic code transformation , 2011, CF '11.

[44]  David E. Bernholdt,et al.  Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .

[45]  Scott B. Baden,et al.  Efficient Run-Time Support for Irregular Block-Structured Applications , 1998, J. Parallel Distributed Comput..

[46]  R. Bartlett,et al.  The coupled‐cluster single, double, triple, and quadruple excitation method , 1992 .

[47]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[48]  Mihály Kállay,et al.  Approximate treatment of higher excitations in coupled-cluster theory. , 2005, The Journal of chemical physics.

[49]  Courtenay T. Vaughan,et al.  Zoltan data management services for parallel dynamic applications , 2002, Comput. Sci. Eng..

[50]  J. Cizek On the Correlation Problem in Atomic and Molecular Systems. Calculation of Wavefunction Components in Ursell-Type Expansion Using Quantum-Field Theoretical Methods , 1966 .

[51]  R. Bartlett,et al.  Coupled-cluster theory in quantum chemistry , 2007 .

[52]  Robert J. Harrison,et al.  Portable tools and applications for parallel computers , 1991 .

[53]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[54]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[55]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.