Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics

A critical factor for developing robust shared-memory applications is the efficient use of the cache and the communication between threads. Inappropriate data structures, algorithm design, and inefficient thread affinity may result in superfluous communication between threads/cores and severe performance problems. For this reason, state-of-the-art profiling tools focus on thread communication and behavior to present different metrics that enable programmers to write cache-friendly programs. The data shared between a pair of threads should be reused with a reasonable distance to preserve data locality. However, existing tools do not take into account the locality of communication events and mainly focus on analyzing the amount of communication instead. In this paper, we introduce a new method to analyze performance and communication bottlenecks that arise from data-access patterns and thread interactions of each code region. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. We evaluated our approach on the SPLASH and Rodinia benchmark suites. Experimental results validate the effectiveness of our approach by finding communication locality issues due to inefficient data structures and/or poor algorithm implementations. By applying the suggested optimizations, we improved the performance in Rodinia benchmarks by up to 56%. Furthermore, by varying the input size we demonstrated the ability of our method to assess the cache usage and scalability of a given application in terms of its inherent communication.

[1]  James C. Browne,et al.  Performance Optimization of Data Structures Using Memory Access Characterization , 2011, 2011 IEEE International Conference on Cluster Computing.

[2]  Donald Yeung,et al.  Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[3]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[4]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[5]  Philippe Olivier Alexandre Navaux,et al.  Communication in Shared Memory: Concepts, Definitions, and Efficient Detection , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[6]  Philippe Olivier Alexandre Navaux,et al.  Locality and Balance for Communication-Aware Thread Mapping in Multicore Systems , 2015, Euro-Par.

[7]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[8]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[9]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Ayal Zaks,et al.  Parcae: a system for flexible parallel execution , 2012, PLDI.

[11]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[12]  Sean Peisert,et al.  Fingerprinting Communication and Computation on HPC Machines , 2010 .

[13]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Matt Bishop,et al.  Multiclass classification of distributed memory parallel computations , 2012, Pattern Recognit. Lett..

[15]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[16]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17]  Ahmad Faraj,et al.  Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.

[18]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Philippe Olivier Alexandre Navaux,et al.  An Efficient Algorithm for Communication-Based Task Mapping , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[20]  Philippe Olivier Alexandre Navaux,et al.  Locality vs. Balance: Exploring Data Mapping Policies on NUMA Systems , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[21]  Philippe Olivier Alexandre Navaux,et al.  Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[22]  David Eklov,et al.  StatStack: Efficient modeling of LRU caches , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[23]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Milind Kulkarni,et al.  Towards architecture independent metrics for multicore performance analysis , 2011, PERV.

[25]  I. Lee,et al.  Characterizing communication patterns of NAS-MPI benchmark programs , 2009, IEEE Southeastcon 2009.

[26]  Kristof Beyls,et al.  Generating cache hints for improved program efficiency , 2005, J. Syst. Archit..

[27]  Zhen Liu,et al.  Lightweight monitoring of MPI programs in real time , 2005, Concurr. Comput. Pract. Exp..

[28]  Philippe Olivier Alexandre Navaux,et al.  Characterizing communication and page usage of parallel applications for thread and data mapping , 2015, Perform. Evaluation.

[29]  Matt Bishop,et al.  Network-theoretic classification of parallel computation patterns , 2011, Int. J. High Perform. Comput. Appl..

[30]  Abdolreza Mirzaei,et al.  Characterizing Loop-Level Communication Patterns in Shared Memory , 2015, 2015 44th International Conference on Parallel Processing.

[31]  J. Shalf,et al.  Understanding ultra-scale application communication requirements , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[32]  Kai Li,et al.  PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on Chip-Multiprocessors , 2008, 2008 IEEE International Symposium on Workload Characterization.

[33]  Zhen Li,et al.  An Efficient Data-Dependence Profiler for Sequential and Parallel Programs , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[34]  P. Sadayappan,et al.  PARDA: A Fast Parallel Reuse Distance Analysis Algorithm , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[35]  YeungDonald,et al.  Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs , 2013 .