A framework for load balancing of Tensor Contraction expressions via dynamic task partitioning
暂无分享,去创建一个
Sriram Krishnamoorthy | P. Sadayappan | Samyam Rajbhandari | Kevin Stock | Pai-Wei Lai | Samyam Rajbhandari | S. Krishnamoorthy | P. Sadayappan | Kevin Stock | Pai-Wei Lai
[1] David E. Bernholdt,et al. Space-time trade-off optimization for a class of electronic structure calculations , 2002, PLDI '02.
[2] Mitsuhiko Toda,et al. Methods for Visual Understanding of Hierarchical System Structures , 1981, IEEE Transactions on Systems, Man, and Cybernetics.
[3] Tjerk P. Straatsma,et al. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..
[4] Jarek Nieplocha,et al. Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..
[5] Oliver Bastert,et al. Layered Drawings of Digraphs , 1999, Drawing Graphs.
[6] Dhabaleswar K. Panda,et al. High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..
[7] David E. Bernholdt,et al. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models , 2005, Proceedings of the IEEE.
[8] James Demmel,et al. Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[9] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..
[10] Sriram Krishnamoorthy,et al. Scalable implementations of accurate excited-state coupled cluster theories: Application of high-level methods to porphyrin-based systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[11] S. Hirata. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories , 2003 .
[12] Kurt Mehlhorn,et al. Graph Algorithm and NP-Completeness , 1984 .
[13] David E. Bernholdt,et al. Identifying Cost-Effective Common Subexpressions to Reduce Operation Count in Tensor Contraction Evaluations , 2006, International Conference on Computational Science.
[14] T. Crawford,et al. An Introduction to Coupled Cluster Theory for Computational Chemists , 2007 .
[15] Sriram Krishnamoorthy,et al. Scioto: A Framework for Global-View Task Parallelism , 2008, 2008 37th International Conference on Parallel Processing.
[16] Ronald L. Graham,et al. Optimal scheduling for two-processor systems , 1972, Acta Informatica.
[17] J. Ramanujam,et al. Loop optimization for a class of memory-constrained computations , 2001, ICS '01.
[18] Laxmikant V. Kalé,et al. Work stealing and persistence-based load balancers for iterative overdecomposed applications , 2012, HPDC '12.
[19] Sriram Krishnamoorthy,et al. Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[20] David E. Bernholdt,et al. Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations , 2005, International Conference on Computational Science.
[21] Kurt Mehlhorn,et al. Data Structures and Algorithms 2: Graph Algorithms and NP-Completeness , 1984, EATCS Monographs on Theoretical Computer Science.
[22] P. Sadayappan,et al. Effective Utilization of Tensor Symmetry in Operation Optimization of Tensor Contraction Expressions , 2012 .
[23] Pavan Balaji,et al. Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions , 2013, 2013 42nd International Conference on Parallel Processing.
[24] David E. Bernholdt,et al. Automatic code generation for many-body electronic structure methods: the tensor contraction engine , 2006 .