Locality Optimizations for Regular and Irregular Applications

The fastest supercomputer in the world as of July 2016 is the Sunway TaihuLight. It can achieve a staggering performance of 93 PetaFlops. This incredible performance is achieved via massive parallelism. Today’s supercomputers and compute clusters have tens of thousands of distributed memory nodes with each node comprised of several shared memory multi/many core processors. Scaling on these massively parallel systems is not an easy task. A major performance and scalability bottleneck is the limited data movement bandwidth, which can be orders of magnitude smaller than the computation bandwidth. Developing applications to scale on these massively parallel systems requires minimizing data movement volume at different levels of memory hierarchy using locality optimization techniques. Locality optimization aims to reduce the data movement between slow and fast memory by rescheduling/remapping the original computation to reuse the data once it is in fast memory, thereby avoiding subsequent movement of the same data from slow memory. This dissertation explores multiple aspects of locality optimizations for enhancing scalability and performance of various regular and irregular applications on massively parallel computing environment. It develops distributed algorithms, lower bound techniques, and compiler and runtime frameworks for optimizing Tensor Contractions, Four-Index Transform, Convolutional Neural Networks (CNNs), and Recursive Tree

[1]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[2]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[3]  Alain Darte On the Complexity of Loop Fusion , 2000, Parallel Comput..

[4]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[5]  J. Ramanujam,et al.  Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[6]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[7]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[8]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[9]  Jason Cong,et al.  Minimizing Computation in Convolutional Neural Networks , 2014, ICANN.

[10]  Fang Wang,et al.  A sparse matrix approach to neural network training , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[11]  Roman Leshchinskiy,et al.  Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.

[12]  R. Bartlett,et al.  Coupled-cluster theory in quantum chemistry , 2007 .

[13]  Robert J. Harrison,et al.  Shared Memory Programming in Metacomputing Environments: The Global Array Approach , 1997, The Journal of Supercomputing.

[14]  Milind Kulkarni,et al.  Tree dependence analysis , 2015, PLDI.

[15]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Matthew L. Leininger,et al.  Psi4: an open‐source ab initio electronic structure program , 2012 .

[17]  Hideo Sekino,et al.  Basis set limit Hartree-Fock and density functional theory response property evaluation by multiresolution multiwavelet basis. , 2008, The Journal of chemical physics.

[18]  Uday Bondhugula,et al.  Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[19]  Sriram Krishnamoorthy,et al.  A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Sriram Krishnamoorthy,et al.  Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[21]  J. Ramanujam,et al.  Performance modeling and optimization of parallel out-of-core tensor contractions , 2005, PPoPP.

[22]  Brian Vinter,et al.  A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors , 2015, J. Parallel Distributed Comput..

[23]  Alexandru Nicolau,et al.  A general data dependence test for dynamic, pointer-based data structures , 1994, PLDI '94.

[24]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[26]  Tobias Gysi,et al.  STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Yves Robert,et al.  Matrix product on heterogeneous master-worker platforms , 2008, PPoPP.

[28]  Simon L. Peyton Jones,et al.  Compiling Haskell by Program Transformation: A Report from the Trenches , 1996, ESOP.

[29]  M. V. Stoitsov,et al.  Deformed coordinate-space Hartree-Fock-Bogoliubov approach to weakly bound nuclei and large deformations , 2008, 0807.3036.

[30]  Jarek Nieplocha,et al.  Global Arrays User Manual , 2007 .

[31]  Christopher R'e,et al.  Caffe con Troll: Shallow Ideas to Speed Up Deep Learning , 2015, DanaC@SIGMOD.

[32]  Vivek Sarkar,et al.  A Transformation Framework for Optimizing Task-Parallel Programs , 2013, TOPL.

[33]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[34]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[35]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[36]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[37]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[39]  Robert J. Harrison,et al.  Fast multiresolution methods for density functional theory in nuclear physics , 2009 .

[40]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[41]  Simon L. Peyton Jones,et al.  A short cut to deforestation , 1993, FPCA '93.

[42]  Henry F. Schaefer,et al.  Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor , 1987 .

[43]  Laurie J. Hendren,et al.  Detecting Parallelism in C Programs with Recursive Darta Structures , 1998, CC.

[44]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Robert J. Harrison,et al.  Parallel direct four-index transformations , 1996 .

[46]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[47]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[48]  John F. Stanton,et al.  The ACES II program system , 1992 .

[49]  Philip Wadler,et al.  Deforestation: Transforming Programs to Eliminate Trees , 1988, Theoretical Computer Science.

[50]  J. Ramanujam,et al.  On Characterizing the Data Access Complexity of Programs , 2014, POPL.

[51]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[52]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Leo A. Meyerovich,et al.  Parallel schedule synthesis for attribute grammars , 2013, PPoPP '13.

[54]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[55]  Sriram Krishnamoorthy,et al.  A Communication-Optimal Framework for Contracting Distributed Tensors , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[56]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[57]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[58]  S. Hirata Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .

[59]  Nancy M. Amato,et al.  STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.

[60]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[61]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[62]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[63]  Patrice Y. Simard,et al.  High Performance Convolutional Neural Networks for Document Processing , 2006 .

[64]  Gregory Beylkin,et al.  Multiresolution quantum chemistry in multiwavelet bases: Hartree-Fock exchange. , 2004, The Journal of chemical physics.

[65]  J. Ramanujam,et al.  Loop optimization for a class of memory-constrained computations , 2001, ICS '01.

[66]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[67]  John F. Stanton,et al.  A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..

[68]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[69]  Robert G. Harrison,et al.  Attosecond electron dynamics: A multiresolution approach , 2012 .

[70]  Thomas R. Furlani,et al.  Implementation of a parallel direct SCF algorithm on distributed memory computers , 1995, J. Comput. Chem..

[71]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[72]  John Darlington,et al.  A Transformation System for Developing Recursive Programs , 1977, J. ACM.

[73]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[74]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[75]  Beverly A. Sanders,et al.  Software design of ACES III with the super instruction architecture , 2011 .

[76]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[77]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[78]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[79]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[80]  Martin C. Rinard,et al.  Symbolic bounds analysis of pointers, array indices, and accessed memory regions , 2005, TOPL.

[81]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[82]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[83]  Martin D. Schatz Anatomy of Parallel Computation with Tensors FLAME Working Note # 72 Ph , 2013 .

[84]  R. C. Whaley,et al.  ATLAS (Automatically Tuned Linear Algebra Software) , 2011, Encyclopedia of Parallel Computing.

[85]  Fang Wang,et al.  An adaptive and fully sparse training approach for multilayer perceptrons , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[86]  T. Crawford,et al.  An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .

[87]  William Scott Thornton,et al.  Electronic Excitations in YTiO3 using TDDFT and electronic structure using a multiresolution framework , 2011 .

[88]  Robert J. Harrison,et al.  On fusing recursive traversals of K-d trees , 2016, CC.

[89]  Mark S. Gordon,et al.  General atomic and molecular electronic structure system , 1993, J. Comput. Chem..

[90]  Mark S. Gordon,et al.  Parallel algorithm for integral transformations and GUGA MCSCF , 1994 .

[91]  David E. Bernholdt,et al.  Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.

[92]  Milind Kulkarni,et al.  Enhancing locality for recursive traversals of recursive structures , 2011, OOPSLA '11.

[93]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[94]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[95]  Shridhar R. Gadre,et al.  A general parallel solution to the integral transformation and second‐order Mo/ller–Plesset energy evaluation on distributed memory parallel machines , 1994 .

[96]  R. J. Harrison,et al.  Coordinate-Space Hartree-Fock-Bogoliubov Solvers for Super fluid Fermi Systems in Large Boxes , 2012 .

[97]  Sriram Krishnamoorthy,et al.  Scalable implementations of accurate excited-state coupled cluster theories: Application of high-level methods to porphyrin-based systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[98]  Gregory Beylkin,et al.  Multiresolution quantum chemistry: basic theory and initial applications. , 2004, The Journal of chemical physics.

[99]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[100]  Mark S. Gordon,et al.  DEVELOPMENTS IN PARALLEL ELECTRONIC STRUCTURE THEORY , 2007 .

[101]  Keshav Pingali,et al.  Optimistic parallelism requires abstractions , 2009, CACM.

[102]  Bradley K. Alpert,et al.  Adaptive solution of partial di erential equations in multiwavelet bases , 2002 .

[103]  Josef Svenningsson Shortcut fusion for accumulating parameters & zip-like functions , 2002, ICFP '02.

[104]  James R. Larus,et al.  Detecting conflicts between structure accesses , 1988, PLDI '88.

[105]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[106]  Robert J. Harrison,et al.  Solving PDEs in irregular geometries with multiresolution methods I: Embedded Dirichlet boundary conditions , 2012, Comput. Phys. Commun..

[107]  Hyuk-Jae Lee,et al.  Generalized Cannon's algorithm for parallel matrix multiplication , 1997, ICS '97.

[108]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[109]  Olatunji Ruwase,et al.  Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems , 2015, KDD.

[110]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[111]  Hassan Foroosh,et al.  Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[112]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[113]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[114]  David E. Bernholdt,et al.  Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations , 2005, International Conference on Computational Science.

[115]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[116]  Uday Bondhugula,et al.  A model for fusion and code motion in an automatic parallelizing compiler , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[117]  David E. Bernholdt,et al.  Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.

[118]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[119]  Robert J. Harrison,et al.  Multiresolution Quantum Chemistry in Multiwavelet Bases , 2003, International Conference on Computational Science.

[120]  Martin C. Rinard,et al.  Pointer analysis for structured parallel programs , 2003, TOPL.