Deconstructing the overhead in parallel applications

Performance problems in parallel programs manifest as lack of scalability. These scalability issues are often very difficult to debug. They can stem from synchronization overhead, poor thread scheduling decisions, or contention for hardware resources, such as shared caches. Traditional profiling tools attribute program cycles to different functions, but do not generate immediate insight into issues limiting scalability. Profiling information is very program-specific and is usually processed manually by a human expert in a time-consuming and cumbersome process. Our experience in tuning performance of parallel applications led us to discover that performance tuning can be considerably simplified, and even to some degree automated, if profiling measurements are organized according to several intuitive performance factors common to most parallel programs. In this work we present these factors and propose a hierarchical framework composing them. We present three case studies where analyzing profiling data according to the proposed principle led us to improve performance of three parallel programs by a factor of 6-20×. Our work lays foundation for new ways of organizing and visualizing profiling data in performance tuning tools.

[1]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2]  Laxmi N. Bhuyan,et al.  Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Francisco J. Cazorla,et al.  Optimal task assignment in multithreaded processors: a statistical approach , 2012, ASPLOS XVII.

[4]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5]  Ryan Johnson,et al.  Decoupling contention management from scheduling , 2010, ASPLOS XV.

[6]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Archana Ganapathi,et al.  A case for machine learning to optimize multicore performance , 2009 .

[8]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[9]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[10]  Sally A. McKee,et al.  Understanding PARSEC performance on contemporary CMPs , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[11]  J. Ortega,et al.  A multi-color SOR method for parallel computation , 1982, ICPP.

[12]  Eric A. Brewer,et al.  High-level optimization via automated statistical modeling , 1995, PPOPP '95.

[13]  Stijn Eyerman,et al.  Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[14]  Thomas H. Dunigan KENDALL SQUARE MULTIPROCESSOR: EARLY EXPERIENCES AND PERFORMANCE , 1992 .

[15]  Mark Crovella,et al.  Parallel performance using lost cycles analysis , 1994, SC.

[16]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[17]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[18]  Robert Tappan Morris,et al.  Locating cache performance bottlenecks using data profiling , 2010, EuroSys '10.