Characterizing communication and page usage of parallel applications for thread and data mapping

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[3]  Anoop Gupta,et al.  Modeling communication in parallel algorithms: a fruitful interaction between theory and systems? , 1994, SPAA '94.

[4]  F. Pellegrini,et al.  Static mapping by dual recursive bipartitioning of process architecture graphs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[5]  Chita R. Das,et al.  Towards a communication characterization methodology for parallel applications , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[6]  David J. Lilja,et al.  Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs , 1998, CANPC.

[7]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[8]  Ahmad Faraj,et al.  Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.

[9]  J. L. Traff Implementing the MPI Process Topology Mechanism , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[11]  R. Vanderwijngaart,et al.  NAS Parallel Benchmarks, Multi-Zone Versions , 2003 .

[12]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[13]  Jack Dongarra,et al.  Introduction to the HPCChallenge Benchmark Suite , 2004 .

[14]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[16]  Sverker Holmgren,et al.  affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system , 2005, ICS '05.

[17]  Rob H. Bisseling,et al.  Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[18]  Frank Mueller,et al.  Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[19]  Guillaume Mercier,et al.  Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem , 2006, PVM/MPI.

[20]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[21]  Jean Roman,et al.  Exploiting Intensive Multithreading for the Efficient Simulation of 3D Seismic Wave Propagation , 2008, 2008 11th IEEE International Conference on Computational Science and Engineering.

[22]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Jeffrey K. Hollingsworth,et al.  Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..

[24]  Jean-François Méhaut,et al.  Parallel simulations of seismic wave propagation on NUMA architectures , 2009, PARCO.

[25]  I. Lee,et al.  Characterizing communication patterns of NAS-MPI benchmark programs , 2009, IEEE Southeastcon 2009.

[26]  Michael Stumm,et al.  Enhancing operating system support for multicore processors by using hardware performance monitoring , 2009, OPSR.

[27]  Guillaume Mercier,et al.  Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments , 2009, PVM/MPI.

[28]  Jean-François Méhaut,et al.  Memory Affinity for Hierarchical Shared Memory Multiprocessors , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[29]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[30]  Brice Goglin,et al.  Enabling high-performance memory migration for multithreaded applications on LINUX , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[31]  Frank Mueller,et al.  Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..

[32]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[33]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35]  Michael Ott,et al.  autopin - Automated Optimization of Thread-to-Core Pinning on Multicore Systems , 2011, Trans. High Perform. Embed. Archit. Compil..

[36]  Emmanuel Jeannot,et al.  Improving MPI Applications Performance on Multicore Clusters with Rank Reordering , 2011, EuroMPI.

[37]  Jeffrey M. Squyres,et al.  Locality-Aware Parallel Process Mapping for Multi-core HPC Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[38]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[39]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[40]  Philippe Olivier Alexandre Navaux,et al.  Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[41]  Yurii A. Vlasov,et al.  Technologies for exascale systems , 2011, IBM J. Res. Dev..

[42]  Vivien Quéma,et al.  MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[43]  Thomas R. Gross,et al.  Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[44]  José Duato,et al.  Understanding Cache Hierarchy Contention in CMPs to Improve Job Scheduling , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[45]  Wei Wang,et al.  Performance analysis of thread mappings with a holistic view of the hardware resources , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[46]  Philippe Olivier Alexandre Navaux,et al.  Using the Translation Lookaside Buffer to Map Threads in Parallel Applications Based on Shared Memory , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[47]  Philippe Olivier Alexandre Navaux,et al.  Communication-Based Mapping Using Shared Pages , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[48]  Ragavendra Natarajan,et al.  Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[49]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[50]  Kenji Ono,et al.  Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments , 2013 .

[51]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[52]  Brice Goglin,et al.  KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[53]  B. Brandfass,et al.  Rank reordering for MPI communication optimization , 2013 .

[54]  Thomas R. Gross,et al.  (Mis)understanding the NUMA memory system performance of multithreaded workloads , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[55]  Philippe Olivier Alexandre Navaux,et al.  Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols , 2014, J. Parallel Distributed Comput..

[56]  Vivien Quéma,et al.  Large Pages May Be Harmful on NUMA Systems , 2014, USENIX Annual Technical Conference.