Topology-Aware MPI Communication and Scheduling for High Performance Computing Systems
暂无分享,去创建一个
[1] S. C. Johnson. Hierarchical clustering schemes , 1967, Psychometrika.
[2] Miss A.O. Penney. (b) , 1974, The New Yale Book of Quotations.
[3] Kenneth Steiglitz,et al. Combinatorial Optimization: Algorithms and Complexity , 1981 .
[4] Shahid H. Bokhari,et al. On the Mapping Problem , 1981, IEEE Transactions on Computers.
[5] Charles E. Leiserson,et al. Randomized Routing on Fat-Trees , 1989, Adv. Comput. Res..
[6] Jake K. Aggarwal,et al. A Mapping Strategy for Parallel Processing , 1987, IEEE Transactions on Computers.
[7] Francine Berman,et al. On Mapping Parallel Algorithms into Parallel Architectures , 1987, J. Parallel Distributed Comput..
[8] M. Nei,et al. The neighbor-joining method , 1987 .
[9] Scott F. Midkiff,et al. Processor and Link Assignment in Multicomputers Using Simulated Annealing , 1988, ICPP.
[10] J. Ramanujam,et al. Task allocation onto a hypercube by recursive mincut bipartitioning , 1990, C3P.
[11] Takanobu Baba,et al. A network-topology independent task allocation strategy for parallel computers , 1990, Proceedings SUPERCOMPUTING '90.
[12] Scott F. Midkiff,et al. Heuristic Technique for Processor and Link Assignment in Multicomputers , 1991, IEEE Trans. Computers.
[13] Geoffrey C. Fox,et al. Allocating data to multicomputer nodes by physical optimization algorithms for loosely synchronous computations , 1992, Concurr. Pract. Exp..
[14] Laxmikant V. Kalé,et al. CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.
[15] Geoffrey C. Fox,et al. Graph contraction for physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers , 1993, ICS '93.
[16] Don Allen,et al. A scalable debugger for massively parallel message-passing programs , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.
[17] Charles L. Seitz,et al. Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.
[18] Mohan Kumar,et al. On generalized fat trees , 1995, Proceedings of 9th International Parallel Processing Symposium.
[19] S. Arunkumar,et al. Genetic algorithm based heuristics for the mapping problem , 1995, Comput. Oper. Res..
[20] Jean Roman,et al. SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.
[21] Steven L. Scott,et al. The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus , 1996 .
[22] Eli Upfal,et al. Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..
[23] Laxmikant V. Kalé,et al. Branch and Bound Based Load Balancing for Parallel Applications , 1999, ISCOPE.
[24] Viktor K. Prasanna,et al. Adaptive Communication Algorithms for Distributed Heterogeneous Systems , 1999, J. Parallel Distributed Comput..
[25] Hee Yong Youn,et al. Processor Scheduling and Allocation for 3D Torus Multicomputer Systems , 2000, IEEE Trans. Parallel Distributed Syst..
[26] Francisco Tirado,et al. Impact of PE Mapping on Cray T3E Message-Passing Performance , 2000, Euro-Par.
[27] Wu-chun Feng,et al. The Quadrics network (QsNet): high-performance clustering technology , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.
[28] Allen D. Malony,et al. Performance Technology for Complex Parallel and Distributed Systems , 2001 .
[29] Robert D. Falgout,et al. hypre: A Library of High Performance Preconditioners , 2002, International Conference on Computational Science.
[30] Bronis R. de Supinski,et al. A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids , 2002, ArXiv.
[31] José E. Moreira,et al. Job Scheduling for the BlueGene/L System (Research Note) , 2002, Euro-Par.
[32] José E. Moreira,et al. Job Scheduling for the BlueGene/L System , 2002, JSSPP.
[33] Vipin Kumar,et al. Parallel static and dynamic multi‐constraint graph partitioning , 2002, Concurr. Comput. Pract. Exp..
[34] Scott Pakin,et al. STORM: Lightning-Fast Resource Management , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[35] Rajeev Thakur,et al. Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.
[36] Antonio Robles,et al. Supporting fully adaptive routing in InfiniBand networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[37] Katherine Yelick,et al. UPC Language Specifications V1.1.1 , 2003 .
[38] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .
[39] Craig A. Lee,et al. Topology-Aware Communication in Wide-Area Message-Passing , 2003, PVM/MPI.
[40] Dhabaleswar K. Panda,et al. Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[41] J.L. Traff. Hierarchical gather/scatter algorithms with graceful degradation , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[42] Marjan Gusev,et al. Improving Multilevel Approach for Optimizing Collective Communications in Computational Grids , 2005, EGC.
[43] James A. Kahle,et al. The Cell Processor Architecture , 2005, MICRO.
[44] P. Maechling,et al. Strong shaking in Los Angeles expected from southern San Andreas earthquake , 2006 .
[45] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..
[46] Amith R. Mamidala,et al. Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand , 2006, PVM/MPI.
[47] Reagan Moore,et al. Optimization and Scalability of an Large-scale Earthquake Simulation Application , 2006 .
[48] Jing Zhu,et al. Enabling Very-Large Scale Earthquake Simulations on Parallel Machines , 2007, International Conference on Computational Science.
[49] Dhabaleswar K. Panda,et al. Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[50] Dhabaleswar K. Panda,et al. Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.
[51] Cloyce D. Spradling. SPEC CPU2006 benchmark tools , 2007, CARN.
[52] Xin Yuan,et al. Bandwidth Efficient All-reduce Operation on Tree Topologies , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[53] Xin Yuan,et al. An MPI tool for automatically discovering the switch level topologies of Ethernet clusters , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[54] Amith R. Mamidala,et al. MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[55] Torsten Hoefler,et al. Multistage switches are not crossbars: Effects of static routing in high-performance networks , 2008, 2008 IEEE International Conference on Cluster Computing.
[56] Amith R. Mamidala,et al. Scaling alltoall collective on multi-core systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[57] Torsten Hoefler,et al. Adaptive Routing Strategies for Modern High Performance Networks , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.
[58] C. Walshaw. JOSTLE : parallel multilevel graph-partitioning software – an overview , 2008 .
[59] Philip Heidelberger,et al. Optimization of All-to-All Communication on the Blue Gene/L Supercomputer , 2008, 2008 37th International Conference on Parallel Processing.
[60] Galen M. Shipman,et al. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives , 2008, PVM/MPI.
[61] Dhabaleswar K. Panda,et al. Designing multi-leader-based Allgather algorithms for multi-core clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[62] Dhabaleswar K. Panda,et al. RDMA over Ethernet — A preliminary study , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[63] Travis J. Wheeler,et al. Large-Scale Neighbor-Joining with NINJA , 2009, WABI.
[64] Torsten Hoefler,et al. The impact of network noise at large-scale communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[65] Thomas Hérault,et al. MPI Applications on Grids: A Topology Aware Approach , 2009, Euro-Par.
[66] Laxmikant V. Kalé,et al. An evaluative study on the effect of contention on message latencies in large supercomputers , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[67] G. Edward Suh,et al. Application-aware deadlock-free oblivious routing , 2009, ISCA '09.
[68] Thomas Hérault,et al. Running Parallel Applications with Topology-Aware Grid Middleware , 2009, 2009 Fifth IEEE International Conference on e-Science.
[69] Nicholas J. Wright,et al. Characterizing Parallel Scaling of Scientific Applications using IPM , 2009 .
[70] Amith R. Mamidala,et al. Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand , 2009, 2009 International Conference on Parallel Processing.
[71] Jing Zhu,et al. Toward petascale earthquake simulations , 2009 .
[72] M. Jette,et al. Simple Linux Utility for Resource Management , 2009 .
[73] L. Kalé,et al. A Pattern Language for Topology Aware Mapping , 2009 .
[74] Sayantan Sur,et al. Improving Application Performance and Predictability Using Multiple Virtual Lanes in Modern Multi-core InfiniBand Clusters , 2010, 2010 39th International Conference on Parallel Processing.
[75] Guillaume Mercier,et al. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.
[76] Wolfgang E. Nagel,et al. VAMPIR: Visualization and Analysis of MPI Resources , 2010 .
[77] Sayantan Sur,et al. Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application , 2010, ICS '10.
[78] Sayantan Sur,et al. Unifying UPC and MPI runtimes: experience with MVAPICH , 2010, PGAS '10.
[79] Dhabaleswar K. Panda,et al. Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[80] William Gropp,et al. A Scalable MPI_Comm_split Algorithm for Exascale Computing , 2010, EuroMPI.
[81] Emmanuel Jeannot,et al. Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.
[82] Sayantan Sur,et al. Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.
[83] Laxmikant V. Kale,et al. Automating Topology Aware Mapping for Supercomputers , 2010 .
[84] Michael Lang,et al. Optimized InfiniBandTM fat‐tree routing for shift all‐to‐all communication patterns , 2010, Concurr. Comput. Pract. Exp..
[85] J. C. Vassilicos,et al. A numerical strategy to combine high-order schemes, complex geometry and parallel computing for high resolution DNS of fractal generated turbulence , 2010 .
[86] Darren J. Kerbyson,et al. Optimized InfiniBand TM fat-tree routing for shift all-to-all communication patterns , 2010, ISC 2010.
[87] Matthias S. Müller,et al. SPEC MPI2007—an application benchmark suite for parallel systems using MPI , 2010, ISC 2010.
[88] Dhabaleswar K. Panda,et al. High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[89] Bernd Mohr,et al. The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..
[90] Laxmikant V. Kalé,et al. Optimizing communication for Charm++ applications by reducing network contention , 2011, Concurr. Comput. Pract. Exp..
[91] Sayantan Sur,et al. Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures , 2011, 2011 18th International Conference on High Performance Computing.
[92] Jonathan Green,et al. Multi-core and Network Aware MPI Topology Functions , 2011, EuroMPI.
[93] Sayantan Sur,et al. Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters , 2011, 2011 IEEE International Conference on Cluster Computing.
[94] Xian-He Sun,et al. Layout-aware scientific computing: a case study using MILC , 2011, ScalA '11.
[95] Emmanuel Jeannot,et al. Improving MPI Applications Performance on Multicore Clusters with Rank Reordering , 2011, EuroMPI.
[96] Torsten Hoefler,et al. Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.
[97] Sriram Krishnamoorthy,et al. Noncollective Communicator Creation in MPI , 2011, EuroMPI.
[98] Vipin Chaudhary,et al. Rack aware scheduling in HPC data centers: an energy conservation strategy , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[99] Jose Sreeram,et al. UPC Queues for Scalable Graph Traversals: Design and Evaluation on InfiniBand Clusters , 2011 .
[100] Dhabaleswar K. Panda,et al. Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).
[101] Robert D. Falgout,et al. Scaling Hypre's Multigrid Solvers to 100, 000 Cores , 2011, High-Performance Scientific Computing.
[102] Dhabaleswar K. Panda,et al. Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework , 2012, 2012 IEEE International Conference on Cluster Computing.
[103] Dhabaleswar K. Panda,et al. Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation , 2012, 2012 41st International Conference on Parallel Processing.
[104] Eitan Zahavi. Fat-tree routing and node ordering providing contention free traffic for MPI global collectives , 2012, J. Parallel Distributed Comput..
[105] Dhabaleswar K. Panda,et al. Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[106] Dhabaleswar K. Panda,et al. Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[107] Dhabaleswar K. Panda,et al. Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[108] Yiannis Georgiou,et al. Evaluating Scalability and Efficiency of the Resource and Job Management System on Large HPC Clusters , 2012, JSSPP.
[109] Dhabaleswar K. Panda,et al. High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[110] D. Panda,et al. Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[111] Torsten Hoefler,et al. Bandwidth-optimal all-to-all exchanges in fat tree networks , 2013, ICS '13.
[112] Dhabaleswar K. Panda,et al. MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand , 2013, ICS '13.
[113] S.,et al. An Efficient Heuristic Procedure for Partitioning Graphs , 2022 .