X10 for High-Performance Scientific Computing

High performance computing is a key technology that enables large-scale physical simulation in modern science. While great advances have been made in methods and algorithms for scientific computing, the most commonly used programming models encourage a fragmented view of computation that maps poorly to the underlying computer architecture. Scientific applications typically manifest physical locality, which means that interactions between entities or events that are nearby in space or time are stronger than more distant interactions. Linear-scaling methods exploit physical locality by approximating distant interactions, to reduce computational complexity so that cost is proportional to system size. In these methods, the computation required for each portion of the system is different depending on that portion’s contribution to the overall result. To support productive development, application programmers need programming models that cleanly map aspects of the physical system being simulated to the underlying computer architecture while also supporting the irregular workloads that arise from the fragmentation of a physical system. X10 is a new programming language for high-performance computing that uses the asynchronous partitioned global address space (APGAS) model, which combines explicit representation of locality with asynchronous task parallelism. This thesis argues that the X10 language is well suited to expressing the algorithmic properties of locality and irregular parallelism that are common to many methods for physical simulation. The work reported in this thesis was part of a co-design effort involving researchers at IBM and ANU in which two significant computational chemistry codes were developed in X10, with an aim to improve the expressiveness and performance of the language. The first is a Hartree–Fock electronic structure code, implemented using the novel Resolution of the Coulomb Operator approach. The second evaluates electrostatic interactions between point charges, using either the smooth particle mesh Ewald method or the fast multipole method, with the latter used to simulate ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer. We compare the performance of both X10 applications to state-of-the-art software packages written in other languages. This thesis presents improvements to the X10 language and runtime libraries for managing and visualizing the data locality of parallel tasks, communication using active messages, and efficient implementation of distributed arrays. We evaluate these

[1]  T. Darden,et al.  Particle mesh Ewald: An N⋅log(N) method for Ewald sums in large systems , 1993 .

[2]  Arch D. Robison,et al.  Structured Parallel Programming: Patterns for Efficient Computation , 2012 .

[3]  Gustavo E. Scuseria,et al.  A fast multipole method for periodic systems with arbitrary unit cell geometries , 1998 .

[4]  Makoto Taiji,et al.  42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Grant S. Heffelfinger,et al.  Parallel atomistic simulations , 2000 .

[6]  Michael S. Warren,et al.  Astrophysical N-body simulations using hierarchical tree data structures , 1992, Proceedings Supercomputing '92.

[7]  Robert J. Harrison,et al.  Asynchronous Programming in UPC: A Case Study and Potential for Improvement , 2009 .

[8]  T. Darden,et al.  A Multipole-Based Algorithm for Efficient Calculation of Forces and Potentials in Macroscopic Period , 1996 .

[9]  Michael Voss,et al.  Optimization via Reflection on Work Stealing in TBB , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  B. Chamberlain,et al.  Authoring User-Defined Domain Maps in Chapel ∗ , 2011 .

[11]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[12]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[13]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[14]  Vivek Sarkar,et al.  Habanero-Java: the new adventures of old X10 , 2011, PPPJ.

[15]  Sriram Krishnamoorthy,et al.  Global Futures: A Multithreaded Execution Model for Global Arrays-based Applications , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[16]  Katherine Yelick,et al.  Hierarchical Work Stealing on Manycore Clusters , 2011 .

[17]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[18]  Clemens C. J. Roothaan,et al.  New Developments in Molecular Orbital Theory , 1951 .

[19]  Lexing Ying,et al.  A New Parallel Kernel-Independent Fast Multipole Method , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[20]  A. Szabó,et al.  Modern quantum chemistry : introduction to advanced electronic structure theory , 1982 .

[21]  Richard W. Vuduc,et al.  Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  V. Fock,et al.  Näherungsmethode zur Lösung des quantenmechanischen Mehrkörperproblems , 1930 .

[23]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[24]  Hari Sundar,et al.  Bottom-Up Construction and 2: 1 Balance Refinement of Linear Octrees in Parallel , 2008, SIAM J. Sci. Comput..

[25]  José E. Moreira,et al.  A Volumetric FFT for BlueGene/L , 2003, HiPC.

[26]  B. Tidor Molecular dynamics simulations , 1997, Current Biology.

[27]  Kenjiro Taura,et al.  A Task Parallel Implementation of Fast Multipole Methods , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[28]  Olivier Tardieu,et al.  A work-stealing scheduler for X10's task parallelism with suspension , 2012, PPoPP '12.

[29]  Vivek Sarkar,et al.  Unified Analysis of Array and Object References in Strongly Typed Languages , 2000, SAS.

[30]  Jarek Nieplocha,et al.  Efficient Algorithms for Ghost Cell Updates on Two Classes of MPP Architectures , 2002, IASTED PDCS.

[31]  David Cunningham,et al.  X10 and APGAS at Petascale , 2016, ACM Trans. Parallel Comput..

[32]  Adrian Prantl,et al.  Interfacing Chapel with traditional HPC programming languages , 2011 .

[33]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[34]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[35]  Richard W. Vuduc,et al.  Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[37]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[38]  Rick Stevens,et al.  Toward high‐performance computational chemistry: II. A scalable self‐consistent field program , 1996 .

[39]  J. Kussmann,et al.  Linear‐Scaling Methods in Quantum Chemistry , 2007 .

[40]  David E. Bernholdt,et al.  Programmability of the HPCS Languages: A Case Study with a Quantum Chemistry Kernel (Extended Version) , 2008 .

[41]  Alistair P. Rendell,et al.  Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10 , 2014, J. Comput. Chem..

[42]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[43]  Kiyokuni Kawachiya,et al.  Distributed garbage collection for managed X10 , 2012, X10 '12.

[44]  Robert J. Harrison,et al.  Performance and experience with LAPI-a new high-performance communication library for the IBM RS/6000 SP , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[45]  Toshikazu Ebisuzaki,et al.  Hardware accelerator for molecular dynamics: MDGRAPE-2 , 2003 .

[46]  C. H. Flood,et al.  The Fortress Language Specification , 2007 .

[47]  S. Guan,et al.  Ion traps for Fourier transform ion cyclotron resonance mass spectrometry: principles and design of geometric and electric configurations , 1995 .

[48]  B. Chamberlain,et al.  The State of the Chapel Union , 2013 .

[49]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[50]  Michael J Frisch,et al.  Efficient evaluation of short-range Hartree-Fock exchange in large molecules and periodic systems. , 2006, The Journal of chemical physics.

[51]  Michela Taufer,et al.  FENZI: GPU-Enabled Molecular Dynamics Simulations of Large Membrane Regions Based on the CHARMM Force Field and PME , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[52]  Benny G. Johnson,et al.  Linear scaling density functional calculations via the continuous fast multipole method , 1996 .

[53]  R. Heeren,et al.  Comparison of particle-in-cell simulations with experimentally observed frequency shifts between ions of the same mass-to-charge in fourier transform ion cyclotron resonance mass spectrometry , 2010, Journal of the American Society for Mass Spectrometry.

[54]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[55]  Michele Colajanni,et al.  PSBLAS: a library for parallel linear algebra computation on sparse matrices , 2000, TOMS.

[56]  L. Patacchini,et al.  Explicit time-reversible orbit integration in Particle In Cell codes with static homogeneous magnetic field , 2009, J. Comput. Phys..

[57]  V. Springel,et al.  GADGET: a code for collisionless and gasdynamical cosmological simulations , 2000, astro-ph/0003162.

[58]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[59]  Scott R. Kohn,et al.  High-performance language interoperability for scientific computing through Babel , 2012, Int. J. High Perform. Comput. Appl..

[60]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[61]  Holger Dachsel,et al.  Fast and accurate determination of the Wigner rotation matrices in the fast multipole method. , 2006, The Journal of chemical physics.

[62]  Mark S. Gordon,et al.  General atomic and molecular electronic structure system , 1993, J. Comput. Chem..

[63]  M. Snir,et al.  Ghost Cell Pattern , 2010, ParaPLoP '10.

[64]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[65]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[66]  David Grove,et al.  Supporting Array Programming in X10 , 2014, ARRAY@PLDI.

[67]  Mark S. Gordon,et al.  New Multithreaded Hybrid CPU/GPU Approach to Hartree-Fock. , 2012, Journal of chemical theory and computation.

[68]  Vivek Sarkar,et al.  Hierarchical phasers for scalable synchronization and reductions in dynamic parallelism , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[69]  Bradford L. Chamberlain The design and implementation of a region-based parallel language , 2001 .

[70]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[71]  Josh Milthorpe,et al.  Resolutions of the Coulomb Operator: VII. Evaluation of Long-Range Coulomb and Exchange Matrices. , 2013, Journal of chemical theory and computation.

[72]  Mark F. Adams,et al.  Chombo Software Package for AMR Applications Design Document , 2014 .

[73]  Martin Head-Gordon,et al.  Rotating around the quartic angular momentum barrier in fast multipole method calculations , 1996 .

[74]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[75]  H. Berendsen Simulating the Physical World , 2004 .

[76]  Volker Dyczmons,et al.  No N4-dependence in the calculation of large molecules , 1973 .

[77]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[78]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[79]  R. Heeren,et al.  Realistic modeling of ion cloud motion in a Fourier transform ion cyclotron resonance cell by use of a particle-in-cell approach. , 2007, Rapid communications in mass spectrometry : RCM.

[80]  Peter M W Gill,et al.  Resolutions of the Coulomb operator. VI. Computation of auxiliary integrals. , 2011, The Journal of chemical physics.

[81]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, Hot Interconnects.

[82]  C. Birdsall,et al.  Plasma Physics via Computer Simulation , 2018 .

[83]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[84]  David Grove,et al.  X10 as a Parallel Language for Scientific Computation: Practice and Experience , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[85]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[86]  Toyotaro Suzumura,et al.  Scalable performance of ScaleGraph for large scale graph analysis , 2012, 2012 19th International Conference on High Performance Computing.

[87]  Richard W. Vuduc,et al.  A massively parallel adaptive fast-multipole method on heterogeneous architectures , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[88]  Jeffrey C. Carver,et al.  Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[89]  William N. Scherer,et al.  A new vision for coarray Fortran , 2009, PGAS '09.

[90]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[91]  Jason Duell,et al.  Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations , 2004, Int. J. High Perform. Comput. Netw..

[92]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[93]  Guangwen Yang,et al.  Characterization of Smith-Waterman sequence database search in X10 , 2012, X10 '12.

[94]  Martin Head-Gordon,et al.  A Resolution-Of-The-Identity Implementation of the Local Triatomics-In-Molecules Model for Second-Order Møller-Plesset Perturbation Theory with Application to Alanine Tetrapeptide Conformational Energies. , 2005, Journal of chemical theory and computation.

[95]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[96]  Peter M. Kasson,et al.  GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit , 2013, Bioinform..

[97]  David Cunningham,et al.  A performance model for X10 applications: what's going on under the hood? , 2011, X10 '11.

[98]  D. Zorin,et al.  A kernel-independent adaptive fast multipole algorithm in two and three dimensions , 2004 .

[99]  Stephen W. Taylor,et al.  KWIK: Coulomb Energies in O(N) Work , 1996 .

[100]  Marco Häser,et al.  Improvements on the direct SCF method , 1989 .

[101]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[102]  Jakub Kurzak,et al.  Massively parallel implementation of a fast multipole method for distributed memory machines , 2005, J. Parallel Distributed Comput..

[103]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[104]  T. Darden,et al.  A smooth particle mesh Ewald method , 1995 .

[105]  Martin Head-Gordon,et al.  Derivation and efficient implementation of the fast multipole method , 1994 .

[106]  Alistair P. Rendell,et al.  PGAS‐FMM: Implementing a distributed fast multipole method using the X10 programming language , 2014, Concurr. Comput. Pract. Exp..

[107]  Hans Peter Lüthi,et al.  A coarse‐grain parallel implementation of the direct SCF method , 1992 .

[108]  Richard W. Vuduc,et al.  Brief announcement: towards a communication optimal fast multipole method and its implications at exascale , 2012, SPAA '12.

[109]  Koichi Tanaka,et al.  Influence of Ion-Ion Coulomb Interactions on FT-ICR Mass Spectra at a High Magnetic Field: A Many-Particle Simulation Using a Special-Purpose Computer , 2010 .

[110]  William N. Scherer,et al.  Hiding latency in Coarray Fortran 2.0 , 2010, PGAS '10.

[111]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[112]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[113]  Sreedhar B. Kodali,et al.  The Asynchronous Partitioned Global Address Space Model , 2010 .

[114]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[115]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[116]  R. Heeren,et al.  Fourier Transform Ion Cyclotron Resonance Mass Resolution and Dynamic Range Limits Calculated by Computer Modeling of Ion Cloud Motion , 2012, Journal of The American Society for Mass Spectrometry.

[117]  Ken Thompson,et al.  The UNIX time-sharing system , 1974, CACM.

[118]  Christos Davatzikos,et al.  Low-constant parallel algorithms for finite element simulations using linear octrees , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[119]  Rio Yokota,et al.  An FMM Based on Dual Tree Traversal for Many-Core Architectures , 2012, ArXiv.

[120]  L. Greengard,et al.  A new version of the Fast Multipole Method for the Laplace equation in three dimensions , 1997, Acta Numerica.

[121]  Martin Head-Gordon,et al.  Advances in Methods and Algorithms in a Modern Quantum Chemistry Program Package , 2006 .

[122]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[123]  Taweetham Limpanuparb,et al.  Applications of Resolutions of the Coulomb Operator in Quantum Chemistry , 2012 .

[124]  Ivan S Ufimtsev,et al.  Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. , 2008, Journal of chemical theory and computation.

[125]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[126]  A. Marshall,et al.  Fourier transform ion cyclotron resonance mass spectrometry: a primer. , 1998, Mass spectrometry reviews.

[127]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[128]  J. Demmel,et al.  Sun Microsystems , 1996 .

[129]  Toyotaro Suzumura,et al.  Introducing ScaleGraph: an X10 library for billion scale graph analytics , 2012, X10 '12.

[130]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[131]  Phillip Colella,et al.  Parallel Languages and Compilers: Perspective From the Titanium Experience , 2007, Int. J. High Perform. Comput. Appl..

[132]  Carsten Kutzner,et al.  GROMACS 4:  Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.

[133]  Junichiro Makino,et al.  4.45 Pflops astrophysical N-body simulation on K computer -- The gravitational trillion-body problem , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[134]  Vijay A. Saraswat,et al.  A Resilient Framework for Iterative Linear Algebra Applications in X10 , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[135]  M. Deserno,et al.  HOW TO MESH UP EWALD SUMS. II. AN ACCURATE ERROR ESTIMATE FOR THE PARTICLE-PARTICLE-PARTICLE-MESH ALGORITHM , 1998, cond-mat/9807100.

[136]  Amith R. Mamidala,et al.  PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[137]  Shridhar R. Gadre,et al.  Structure and Stability of Water Clusters (H2O)n, n ) 8-20: An Ab Initio Investigation , 2001 .

[138]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[139]  Hatem Ltaief,et al.  Data‐driven execution of fast multipole methods , 2012, Concurr. Comput. Pract. Exp..

[140]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[141]  Kenjiro Taura,et al.  MassiveThreads: A Thread Library for High Productivity Languages , 2014, Concurrent Objects and Beyond.

[142]  R W Hockney,et al.  Computer Simulation Using Particles , 1966 .

[143]  Michael Klemm,et al.  A Proposal for Task-Generating Loops in OpenMP , 2013, IWOMP.

[144]  Alistair P. Rendell,et al.  Efficient update of ghost regions using active messages , 2012, 2012 19th International Conference on High Performance Computing.

[145]  G. G. Hall The molecular orbital theory of chemical valency VIII. A method of calculating ionization potentials , 1951, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[146]  Eduard Ayguadé,et al.  A Library Implementation of the Nano-Threads Programming Model , 1996, Euro-Par, Vol. II.

[147]  Sadaf R. Alam,et al.  DARPA's HPCS Program- History, Models, Tools, Languages , 2008, Adv. Comput..

[148]  Vivek Sarkar,et al.  Array optimizations for parallel implementations of high productivity languages , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.