论文信息 - Characterizing communication and page usage of parallel applications for thread and data mapping - 字舞流文

Characterizing communication and page usage of parallel applications for thread and data mapping

Philippe Olivier Alexandre Navaux | Laércio Lima Pilla | Fabrice Dupros | Matthias Diener | Eduardo Henrique Molina da Cruz | P. Navaux | F. Dupros | M. Diener | E. Cruz | L. Pilla

[1] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[2] Shahid H. Bokhari,et al. On the Mapping Problem , 1981, IEEE Transactions on Computers.

[3] Anoop Gupta,et al. Modeling communication in parallel algorithms: a fruitful interaction between theory and systems? , 1994, SPAA '94.

[4] F. Pellegrini,et al. Static mapping by dual recursive bipartitioning of process architecture graphs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[5] Chita R. Das,et al. Towards a communication characterization methodology for parallel applications , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[6] David J. Lilja,et al. Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs , 1998, CANPC.

[7] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[8] Ahmad Faraj,et al. Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.

[9] J. L. Traff. Implementing the MPI Process Topology Mechanism , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10] William Gropp,et al. MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[11] R. Vanderwijngaart,et al. NAS Parallel Benchmarks, Multi-Zone Versions , 2003 .

[12] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[13] Jack Dongarra,et al. Introduction to the HPCChallenge Benchmark Suite , 2004 .

[14] Zeshan Chishti,et al. Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[16] Sverker Holmgren,et al. affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system , 2005, ICS '05.

[17] Rob H. Bisseling,et al. Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[18] Frank Mueller,et al. Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[19] Guillaume Mercier,et al. Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem , 2006, PVM/MPI.

[20] Wenguang Chen,et al. MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[21] Jean Roman,et al. Exploiting Intensive Multithreading for the Efficient Simulation of 3D Seismic Wave Propagation , 2008, 2008 11th IEEE International Conference on Computational Science and Engineering.

[22] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23] Jeffrey K. Hollingsworth,et al. Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..

[24] Jean-François Méhaut,et al. Parallel simulations of seismic wave propagation on NUMA architectures , 2009, PARCO.

[25] I. Lee,et al. Characterizing communication patterns of NAS-MPI benchmark programs , 2009, IEEE Southeastcon 2009.

[26] Michael Stumm,et al. Enhancing operating system support for multicore processors by using hardware performance monitoring , 2009, OPSR.

[27] Guillaume Mercier,et al. Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments , 2009, PVM/MPI.

[28] Jean-François Méhaut,et al. Memory Affinity for Hierarchical Shared Memory Multiprocessors , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[29] Simon W. Moore,et al. A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[30] Brice Goglin,et al. Enabling high-performance memory migration for multithreaded applications on LINUX , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[31] Frank Mueller,et al. Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..

[32] Emmanuel Jeannot,et al. Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[33] David W. Nellans,et al. Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34] Alexandra Fedorova,et al. A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35] Michael Ott,et al. autopin - Automated Optimization of Thread-to-Core Pinning on Multicore Systems , 2011, Trans. High Perform. Embed. Archit. Compil..

[36] Emmanuel Jeannot,et al. Improving MPI Applications Performance on Multicore Clusters with Rank Reordering , 2011, EuroMPI.

[37] Jeffrey M. Squyres,et al. Locality-Aware Parallel Process Mapping for Multi-core HPC Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[38] Torsten Hoefler,et al. Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[39] Andrew A. Chien,et al. The future of microprocessors , 2011, Commun. ACM.

[40] Philippe Olivier Alexandre Navaux,et al. Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[41] Yurii A. Vlasov,et al. Technologies for exascale systems , 2011, IBM J. Res. Dev..

[42] Vivien Quéma,et al. MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[43] Thomas R. Gross,et al. Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[44] José Duato,et al. Understanding Cache Hierarchy Contention in CMPs to Improve Job Scheduling , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[45] Wei Wang,et al. Performance analysis of thread mappings with a holistic view of the hardware resources , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[46] Philippe Olivier Alexandre Navaux,et al. Using the Translation Lookaside Buffer to Map Threads in Parallel Applications Based on Shared Memory , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[47] Philippe Olivier Alexandre Navaux,et al. Communication-Based Mapping Using Shared Pages , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[48] Ragavendra Natarajan,et al. Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[49] Michael Frumkin,et al. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[50] Kenji Ono,et al. Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments , 2013 .

[51] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[52] Brice Goglin,et al. KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[53] B. Brandfass,et al. Rank reordering for MPI communication optimization , 2013 .

[54] Thomas R. Gross,et al. (Mis)understanding the NUMA memory system performance of multithreaded workloads , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[55] Philippe Olivier Alexandre Navaux,et al. Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols , 2014, J. Parallel Distributed Comput..

[56] Vivien Quéma,et al. Large Pages May Be Harmful on NUMA Systems , 2014, USENIX Annual Technical Conference.