Locality Optimizations for Regular and Irregular Applications
暂无分享,去创建一个
[1] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[2] Quoc V. Le,et al. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.
[3] Alain Darte. On the Complexity of Loop Fusion , 2000, Parallel Comput..
[4] Martin C. Rinard,et al. Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.
[5] J. Ramanujam,et al. Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[6] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[7] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .
[8] V. Sarkar,et al. Collective Loop Fusion for Array Contraction , 1992, LCPC.
[9] Jason Cong,et al. Minimizing Computation in Convolutional Neural Networks , 2014, ICANN.
[10] Fang Wang,et al. A sparse matrix approach to neural network training , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.
[11] Roman Leshchinskiy,et al. Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.
[12] R. Bartlett,et al. Coupled-cluster theory in quantum chemistry , 2007 .
[13] Robert J. Harrison,et al. Shared Memory Programming in Metacomputing Environments: The Global Array Approach , 1997, The Journal of Supercomputing.
[14] Milind Kulkarni,et al. Tree dependence analysis , 2015, PLDI.
[15] James Demmel,et al. Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[16] Matthew L. Leininger,et al. Psi4: an open‐source ab initio electronic structure program , 2012 .
[17] Hideo Sekino,et al. Basis set limit Hartree-Fock and density functional theory response property evaluation by multiresolution multiwavelet basis. , 2008, The Journal of chemical physics.
[18] Uday Bondhugula,et al. Loop transformations: convexity, pruning and optimization , 2011, POPL '11.
[19] Sriram Krishnamoorthy,et al. A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[20] Sriram Krishnamoorthy,et al. Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[21] J. Ramanujam,et al. Performance modeling and optimization of parallel out-of-core tensor contractions , 2005, PPoPP.
[22] Brian Vinter,et al. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors , 2015, J. Parallel Distributed Comput..
[23] Alexandru Nicolau,et al. A general data dependence test for dynamic, pointer-based data structures , 1994, PLDI '94.
[24] Jürgen Schmidhuber,et al. Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[25] Patrice Y. Simard,et al. Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..
[26] Tobias Gysi,et al. STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[27] Yves Robert,et al. Matrix product on heterogeneous master-worker platforms , 2008, PPoPP.
[28] Simon L. Peyton Jones,et al. Compiling Haskell by Program Transformation: A Report from the Trenches , 1996, ESOP.
[29] M. V. Stoitsov,et al. Deformed coordinate-space Hartree-Fock-Bogoliubov approach to weakly bound nuclei and large deformations , 2008, 0807.3036.
[30] Jarek Nieplocha,et al. Global Arrays User Manual , 2007 .
[31] Christopher R'e,et al. Caffe con Troll: Shallow Ideas to Speed Up Deep Learning , 2015, DanaC@SIGMOD.
[32] Vivek Sarkar,et al. A Transformation Framework for Optimizing Task-Parallel Programs , 2013, TOPL.
[33] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[34] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.
[35] Keshav Pingali,et al. The tao of parallelism in algorithms , 2011, PLDI '11.
[36] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[37] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[38] Misha Denil,et al. Predicting Parameters in Deep Learning , 2014 .
[39] Robert J. Harrison,et al. Fast multiresolution methods for density functional theory in nuclear physics , 2009 .
[40] Honglak Lee,et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.
[41] Simon L. Peyton Jones,et al. A short cut to deforestation , 1993, FPCA '93.
[42] Henry F. Schaefer,et al. Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor , 1987 .
[43] Laurie J. Hendren,et al. Detecting Parallelism in C Programs with Recursive Darta Structures , 1998, CC.
[44] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[45] Robert J. Harrison,et al. Parallel direct four-index transformations , 1996 .
[46] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[47] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[48] John F. Stanton,et al. The ACES II program system , 1992 .
[49] Philip Wadler,et al. Deforestation: Transforming Programs to Eliminate Trees , 1988, Theoretical Computer Science.
[50] J. Ramanujam,et al. On Characterizing the Data Access Complexity of Programs , 2014, POPL.
[51] Yann LeCun,et al. Fast Training of Convolutional Networks through FFTs , 2013, ICLR.
[52] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[53] Leo A. Meyerovich,et al. Parallel schedule synthesis for attribute grammars , 2013, PPoPP '13.
[54] Tjerk P. Straatsma,et al. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..
[55] Sriram Krishnamoorthy,et al. A Communication-Optimal Framework for Contracting Distributed Tensors , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[56] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..
[57] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[58] S. Hirata. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .
[59] Nancy M. Amato,et al. STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.
[60] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[61] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.
[62] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.
[63] Patrice Y. Simard,et al. High Performance Convolutional Neural Networks for Document Processing , 2006 .
[64] Gregory Beylkin,et al. Multiresolution quantum chemistry in multiwavelet bases: Hartree-Fock exchange. , 2004, The Journal of chemical physics.
[65] J. Ramanujam,et al. Loop optimization for a class of memory-constrained computations , 2001, ICS '01.
[66] Andrew Zisserman,et al. Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.
[67] John F. Stanton,et al. A massively parallel tensor contraction framework for coupled-cluster computations , 2014, J. Parallel Distributed Comput..
[68] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..
[69] Robert G. Harrison,et al. Attosecond electron dynamics: A multiresolution approach , 2012 .
[70] Thomas R. Furlani,et al. Implementation of a parallel direct SCF algorithm on distributed memory computers , 1995, J. Comput. Chem..
[71] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.
[72] John Darlington,et al. A Transformation System for Developing Recursive Programs , 1977, J. ACM.
[73] Jarek Nieplocha,et al. Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..
[74] Liu Peng,et al. High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[75] Beverly A. Sanders,et al. Software design of ACES III with the super instruction architecture , 2011 .
[76] Joan Bruna,et al. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.
[77] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[78] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[79] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.
[80] Martin C. Rinard,et al. Symbolic bounds analysis of pointers, array indices, and accessed memory regions , 2005, TOPL.
[81] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.
[82] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.
[83] Martin D. Schatz. Anatomy of Parallel Computation with Tensors FLAME Working Note # 72 Ph , 2013 .
[84] R. C. Whaley,et al. ATLAS (Automatically Tuned Linear Algebra Software) , 2011, Encyclopedia of Parallel Computing.
[85] Fang Wang,et al. An adaptive and fully sparse training approach for multilayer perceptrons , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).
[86] T. Crawford,et al. An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .
[87] William Scott Thornton,et al. Electronic Excitations in YTiO3 using TDDFT and electronic structure using a multiresolution framework , 2011 .
[88] Robert J. Harrison,et al. On fusing recursive traversals of K-d trees , 2016, CC.
[89] Mark S. Gordon,et al. General atomic and molecular electronic structure system , 1993, J. Comput. Chem..
[90] Mark S. Gordon,et al. Parallel algorithm for integral transformations and GUGA MCSCF , 1994 .
[91] David E. Bernholdt,et al. Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.
[92] Milind Kulkarni,et al. Enhancing locality for recursive traversals of recursive structures , 2011, OOPSLA '11.
[93] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.
[94] James Demmel,et al. Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[95] Shridhar R. Gadre,et al. A general parallel solution to the integral transformation and second‐order Mo/ller–Plesset energy evaluation on distributed memory parallel machines , 1994 .
[96] R. J. Harrison,et al. Coordinate-Space Hartree-Fock-Bogoliubov Solvers for Super fluid Fermi Systems in Large Boxes , 2012 .
[97] Sriram Krishnamoorthy,et al. Scalable implementations of accurate excited-state coupled cluster theories: Application of high-level methods to porphyrin-based systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[98] Gregory Beylkin,et al. Multiresolution quantum chemistry: basic theory and initial applications. , 2004, The Journal of chemical physics.
[99] James Demmel,et al. Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..
[100] Mark S. Gordon,et al. DEVELOPMENTS IN PARALLEL ELECTRONIC STRUCTURE THEORY , 2007 .
[101] Keshav Pingali,et al. Optimistic parallelism requires abstractions , 2009, CACM.
[102] Bradley K. Alpert,et al. Adaptive solution of partial di erential equations in multiwavelet bases , 2002 .
[103] Josef Svenningsson. Shortcut fusion for accumulating parameters & zip-like functions , 2002, ICFP '02.
[104] James R. Larus,et al. Detecting conflicts between structure accesses , 1988, PLDI '88.
[105] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.
[106] Robert J. Harrison,et al. Solving PDEs in irregular geometries with multiresolution methods I: Embedded Dirichlet boundary conditions , 2012, Comput. Phys. Commun..
[107] Hyuk-Jae Lee,et al. Generalized Cannon's algorithm for parallel matrix multiplication , 1997, ICS '97.
[108] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.
[109] Olatunji Ruwase,et al. Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems , 2015, KDD.
[110] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[111] Hassan Foroosh,et al. Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[112] Franz Franchetti,et al. Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.
[113] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.
[114] David E. Bernholdt,et al. Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations , 2005, International Conference on Computational Science.
[115] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.
[116] Uday Bondhugula,et al. A model for fusion and code motion in an automatic parallelizing compiler , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[117] David E. Bernholdt,et al. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.
[118] James Demmel,et al. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[119] Robert J. Harrison,et al. Multiresolution Quantum Chemistry in Multiwavelet Bases , 2003, International Conference on Computational Science.
[120] Martin C. Rinard,et al. Pointer analysis for structured parallel programs , 2003, TOPL.