Languages and Compilers for Parallel Computing: 31st International Workshop, LCPC 2018, Salt Lake City, UT, USA, October 9–11, 2018, Revised Selected Papers

Despite the fact that compiler technologies for automatic vectorization have been under development for over four decades, there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization for SIMD units. One such gap can be found in the handling of loops with dependence cycles that involve memorybased anti (write-after-read) and output (write-after-write) dependences. Past approaches, such as variable renaming and variable expansion, break such dependence cycles by either eliminating or repositioning the problematic memory-based dependences. However, the past work suffers from three key limitations: (1) Lack of a unified framework that synergistically integrates multiple storage transformations, (2) Lack of support for bounding the additional space required to break memory-based dependences, and (3) Lack of support for integrating these storage transformations with other code transformations (e.g., statement reordering) to enable vectorization. In this paper, we address the three limitations above by integrating both Source Variable Renaming (SoVR) and Sink Variable Renaming (SiVR) transformations into a unified formulation, and by formalizing the “cycle-breaking” problem as a minimum weighted set cover optimization problem. To the best of our knowledge, our work is the first to formalize an optimal solution for cycle breaking that simultaneously considers both SoVR and SiVR transformations, thereby enhancing vectorization and reducing storage expansion relative to performing the transformations independently. We implemented our approach in PPCG, a state-of-theart optimization framework for loop transformations, and evaluated it on eleven kernels from the TSVC benchmark suite. Our experimental results show a geometric mean performance improvement of 4.61× on an Intel Xeon Phi (KNL) machine relative to the optimized performance obtained by Intel’s ICC v17.0 product compiler. Further, our results demonstrate a geometric mean performance improvement of 1.08× and 1.14× on the Intel Xeon Phi (KNL) and Nvidia Tesla V100 (Volta) platforms relative to past work that only performs the SiVR transformation [5], and of 1.57× and 1.22× on both platforms relative to past work on using both SiVR and SoVR transformations [8]. c © Springer Nature Switzerland AG 2019 M. Hall and H. Sundar (Eds.): LCPC 2018, LNCS 11882, pp. 1–20, 2019. https://doi.org/10.1007/978-3-030-34627-0_1 2 P. Chatarasi et al.

[1]  Toshio Endo,et al.  Scalable RMA-based Communication Library Featuring Node-local NVMs , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[2]  Katherine A. Yelick,et al.  Tuning collective communication for Partitioned Global Address Space programming models , 2011, Parallel Comput..

[3]  John R. Gilbert,et al.  An interactive system for combinatorial scientific computing with an emphasis on programmer productivity , 2007 .

[4]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[5]  Hongbo Rong,et al.  Automating Wavefront Parallelization for Sparse Matrix Computations , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Phillip Colella,et al.  Parallel Languages and Compilers: Perspective From the Titanium Experience , 2007, Int. J. High Perform. Comput. Appl..

[7]  Scott B. Baden,et al.  Bamboo -- Translating MPI applications to a latency-tolerant, data-driven form , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Vivek Sarkar,et al.  Automatic Verification of Determinism for Structured Parallel Programs , 2010, SAS.

[9]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[10]  Jeffery A Kuehn,et al.  OpenSHMEM Performance and Potential: A NPB Experimental Study , 2012 .

[11]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[12]  Ruediger Willenberg,et al.  A Heterogeneous GASNet Implementation for FPGA-accelerated Computing , 2014, PGAS.

[13]  Shoaib Kamil,et al.  Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Silvius Vasile Rus Hybrid analysis of memory references and its application to automatic parallelization , 2006 .

[15]  William Pugh,et al.  Nonlinear array dependence analysis , 1994 .

[16]  George Bosilca,et al.  UCX: An Open Source Framework for HPC Network APIs and Beyond , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[17]  Tao Yang,et al.  Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.

[18]  Amith R. Mamidala,et al.  PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[19]  William Pugh,et al.  Constraint-based array dependence analysis , 1998, TOPL.

[20]  Benoît Meister,et al.  The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[21]  David A. Padua,et al.  Compiler analysis of irregular memory accesses , 2000, PLDI '00.

[22]  Leonardo Mendonça de Moura,et al.  Complete Instantiation for Quantified Formulas in Satisfiabiliby Modulo Theories , 2009, CAV.

[23]  H. Su,et al.  SCI networking for shared-memory computing in UPC: blueprints of the GASNet SCI conduit , 2004, 29th Annual IEEE International Conference on Local Computer Networks.

[24]  John Wawrzynek,et al.  RAMP Blue: A Message-Passing Manycore System in FPGAs , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[25]  Lawrence Rauchwerger,et al.  Hybrid Analysis: Static & Dynamic Memory Reference Analysis , 2004, International Journal of Parallel Programming.

[26]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[27]  Wu-chun Feng,et al.  The Quadrics network (QsNet): high-performance clustering technology , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[28]  Henny B. Sipma,et al.  What's Decidable About Arrays? , 2006, VMCAI.

[29]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[30]  Dhabaleswar K. Panda,et al.  Optimizing Collective Communication in UPC , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[31]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[32]  Jack J. Dongarra,et al.  Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[33]  Katherine A. Yelick,et al.  On the conditions for efficient interoperability with threads: an experience with PGAS languages using cray communication domains , 2014, ICS '14.

[34]  Scott B. Baden,et al.  Toucan — A Translator for Communication Tolerant MPI Applications , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Larry Carter,et al.  An approach for code generation in the Sparse Polyhedral Framework , 2016, Parallel Comput..

[36]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[37]  Katherine Yelick,et al.  Automatic Performance Tuning and Analysis of Sparse Triangular Solve , 2002 .

[38]  Yunheung Paek,et al.  Simplification of array access patterns for compiler optimizations , 1998, PLDI.

[39]  Lawrence Rauchwerger,et al.  Logical inference techniques for loop parallelization , 2012, PLDI.

[40]  Katherine A. Yelick,et al.  UPC++: A PGAS Extension for C++ , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[41]  Robert E. Shostak,et al.  A Practical Decision Procedure for Arithmetic with Function Symbols , 1979, JACM.

[42]  Paul Feautrier,et al.  Fuzzy Array Dataflow Analysis , 1997, J. Parallel Distributed Comput..

[43]  Lawrence Rauchwerger,et al.  A Hybrid Approach to Proving Memory Reference Monotonicity , 2011, LCPC.

[44]  Tomás Vojnar,et al.  What Else Is Decidable about Integer Arrays? , 2008, FoSSaCS.

[45]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[46]  Mitsuhisa Sato,et al.  Preliminary Performance Evaluation of Coarray-based Implementation of Fiber Miniapp Suite using XcalableMP PGAS Language , 2017, PAW@SC.

[47]  Katherine A. Yelick,et al.  Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[48]  Tomofumi Yuki,et al.  Sparse Matrix Code Dependence Analysis Simplification at Compile Time , 2018, ArXiv.