A Communication-Optimal Framework for Contracting Distributed Tensors
暂无分享,去创建一个
Sriram Krishnamoorthy | P. Sadayappan | Samyam Rajbhandari | Kevin Stock | Akshay Nikam | Pai-Wei Lai | Samyam Rajbhandari | S. Krishnamoorthy | P. Sadayappan | Kevin Stock | Akshay Nikam | Pai-Wei Lai
[1] Sriram Krishnamoorthy,et al. Scalable implementations of accurate excited-state coupled cluster theories: Application of high-level methods to porphyrin-based systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[2] J. Ramanujam,et al. Loop optimization for a class of memory-constrained computations , 2001, ICS '01.
[3] Barbara M. Chapman,et al. Performance Analysis of the NWChem TCE for Different Communication Patterns , 2013, PMBS@SC.
[4] Sriram Krishnamoorthy,et al. A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[5] Sriram Krishnamoorthy,et al. Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[6] David E. Bernholdt,et al. Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations , 2005, International Conference on Computational Science.
[7] Jarek Nieplocha,et al. Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..
[8] David E. Bernholdt,et al. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.
[9] Robert J. Harrison,et al. Shared Memory Programming in Metacomputing Environments: The Global Array Approach , 1997, The Journal of Supercomputing.
[10] Hyuk-Jae Lee,et al. Generalized Cannon's algorithm for parallel matrix multiplication , 1997, ICS '97.
[11] James Demmel,et al. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[12] J. Ramanujam,et al. Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[13] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[14] James Demmel,et al. Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..
[15] R. Bartlett,et al. Coupled-cluster theory in quantum chemistry , 2007 .
[16] J. Ramanujam,et al. Performance modeling and optimization of parallel out-of-core tensor contractions , 2005, PPoPP.
[17] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.
[18] Mark S. Gordon,et al. Chapter 41 – Advances in electronic structure theory: GAMESS a decade later , 2005 .
[19] Beverly A. Sanders,et al. Software design of ACES III with the super instruction architecture , 2011 .
[20] Tjerk P. Straatsma,et al. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..
[21] James Demmel,et al. Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[22] David E. Bernholdt,et al. Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.
[23] T. Crawford,et al. An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .
[24] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..
[25] Robert A. van de Geijn,et al. SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .
[26] Dror Irony,et al. Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..
[27] James Demmel,et al. Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.