Parallel Performance Problems on Shared-Memory Multicore Systems: Taxonomy and Observation

The shift towards multicore processing has led to a much wider population of developers being faced with the challenge of exploiting parallel cores to improve software performance. Debugging and optimizing parallel programs is a complex and demanding task. Tools which support development of parallel programs should provide salient information to allow programmers of multicore systems to diagnose and distinguish performance problems. Appropriate design of such tools requires a systematic analysis of the problems which might be identified, and the information used to diagnose them. Building on the literature, we put forward a potential taxonomy of parallel performance problems, and an observational model which links measurable performance data to these problems. We present a validation of this model carried out with parallel programming experts, identifying areas of agreement and disagreement. This is accompanied with a survey of the prevalence of these problems in software development. From this we can identify contentious areas worthy of further exploration, as well as those with high prevalence and strong agreement, which are natural candidates for initial moves towards better tool support.

[1]  A. Roberts Multi-Core Programming Increasing Performance through Software Multi-threading Shameem , 2006 .

[2]  Dawson R. Engler,et al.  RacerX: effective, static detection of race conditions and deadlocks , 2003, SOSP '03.

[3]  Andrew Begel,et al.  Analyze this! 145 questions for data scientists in software engineering , 2013, ICSE.

[4]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[5]  Rachel K. E. Bellamy,et al.  How Programmers Debug, Revisited: An Information Foraging Theory Perspective , 2013, IEEE Transactions on Software Engineering.

[6]  Matthias Hauswirth,et al.  Evaluating the accuracy of Java profilers , 2010, PLDI '10.

[7]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[8]  Claudia Fohry,et al.  Common Mistakes in OpenMP and How to Avoid Them - A Collection of Best Practices , 2005, IWOMP.

[9]  Mahmut T. Kandemir,et al.  Studying inter-core data reuse in multicores , 2011, SIGMETRICS '11.

[10]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[11]  Anand Sivasubramaniam,et al.  Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks , 2002, SIGMETRICS '02.

[12]  Ahmed E. Hassan,et al.  A qualitative study on performance bugs , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[13]  Nihar R. Mahapatra,et al.  The processor-memory bottleneck: problems and solutions , 1999, CROS.

[14]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[15]  Koen De Bosschere Upcoming Computing System Challenges - The HiPEAC Vision (Anstehende Herausforderungen der Computer Industrie - Die HiPEAC Vision) , 2008, it Inf. Technol..

[16]  Clay P. Breshears The Art of Concurrency - A Thread Monkey's Guide to Writing Parallel Applications , 2009 .

[17]  Jonathan Walpole,et al.  Is Parallel Programming Hard, And If So, Why? , 2009 .

[18]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[19]  John A. Fotheringham,et al.  Dynamic storage allocation in the Atlas computer, including an automatic use of a backing store , 1961, Commun. ACM.

[20]  Allen D. Malony,et al.  PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[21]  R. Newton,et al.  Capturing and Composing Parallel Patterns with Intel CnC Ryan Newton Frank Schlimbach Mark Hampton Kathleen Knobe Intel , 2010 .

[22]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[23]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[24]  Shan Lu,et al.  Toddler: Detecting performance problems via similar memory-access patterns , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[25]  Steven A. Hofmeyr,et al.  Oversubscription on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[26]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Mordechai Ben-Ari,et al.  Principles of concurrent programming , 1982 .

[28]  Thomas M. Conte,et al.  Embedded Multicore Processors and Systems , 2009, IEEE Micro.

[29]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[30]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[31]  Sigrid Eldh Software Testing Techniques , 2007 .

[32]  Guru Venkataramani,et al.  DeFT: Design space exploration for on-the-fly detection of coherence misses , 2011, TACO.

[33]  Wenli Zhang,et al.  HaLock: Hardware-assisted lock contention detection in multithreaded applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Lieven Eeckhout,et al.  Undersubscribed threading on clustered cache architectures , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[35]  Koen De Bosschere,et al.  The Hipeac Vision, 2010 , 2010 .

[36]  Caitlin Sadowski,et al.  The last mile: parallel programming and usability , 2010, FoSER '10.

[37]  Boris Beizer,et al.  Software testing techniques (2. ed.) , 1990 .

[38]  Emerson R. Murphy-Hill,et al.  Cowboys, ankle sprains, and keepers of quality: how is video game development different from software development? , 2014, ICSE.

[39]  Michael Wolfe,et al.  Data dependence and its application to parallel processing , 2005, International Journal of Parallel Programming.

[40]  David Gregg,et al.  Design considerations for parallel performance tools , 2014, CHI.

[41]  Dongmei Zhang,et al.  Performance debugging in the large via mining millions of stack traces , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[42]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[43]  Klaas-Jan Stol,et al.  Two's company, three's a crowd: a case study of crowdsourcing software development , 2014, ICSE.

[44]  Hans-Wolfgang Loidl,et al.  Algorithm + strategy = parallelism , 1998, Journal of Functional Programming.

[45]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[46]  Ulrich Drepper,et al.  What Every Programmer Should Know About Memory , 2007 .

[47]  Nathan R. Tallent,et al.  Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.

[48]  Matt Bishop,et al.  Checking for Race Conditions in File Accesses , 1996, Comput. Syst..

[49]  Michael T. Heath,et al.  Visualizing the performance of parallel programs , 1991, IEEE Software.

[50]  Bruno R. Preiss,et al.  Architectural Skeletons: The Re-Usable Building-Blocks for Parallel Applications , 1999, PDPTA.

[51]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[52]  Thomas Fritz,et al.  Using information fragments to answer the questions developers ask , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[53]  Robert J. Fowler,et al.  NUMA policies and their relation to memory architecture , 1991, ASPLOS IV.

[54]  Xiaoyan Zhu,et al.  Does bug prediction support human developers? Findings from a Google case study , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[55]  Brad A. Myers,et al.  An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks , 2006, IEEE Transactions on Software Engineering.

[56]  Lawrence Snyder,et al.  Poker on the Cosmic Cube: The First Retargetable Parallel Programming Language and Environment , 1986, ICPP.

[57]  David R. O'Hallaron,et al.  Computer Systems: A Programmer's Perspective , 1991 .

[58]  Thomas Fahringer Automatic Performance Prediction of Parallel Programs , 1996, Springer US.

[59]  Rachel K. E. Bellamy,et al.  The whats and hows of programmers' foraging diets , 2013, CHI.

[60]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[61]  Ajit Singh,et al.  Design Patterns for Parallel Programming , 1996, PDPTA.

[62]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[63]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[64]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[65]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[66]  José G. Castaños,et al.  Eliminating global interpreter locks in ruby through hardware transactional memory , 2014, PPoPP '14.

[67]  Nathan Clark,et al.  Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[68]  Alan Mycroft,et al.  Limits of parallelism using dynamic dependency graphs , 2009, WODA '09.

[69]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[70]  David M. Nicol,et al.  Performance prediction of a parallel simulator , 1999, Proceedings Thirteenth Workshop on Parallel and Distributed Simulation. PADS 99. (Cat. No.PR00155).

[71]  Thomas L. Casavant Tools and Methods for Visualization of Parallel Systems and Computations - Guest Editor's Introduction , 1993, J. Parallel Distributed Comput..

[72]  Jim Gray,et al.  The convoy phenomenon , 1979, OPSR.

[73]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[74]  Babak Falsafi,et al.  The HiPEAC Vision , 2010 .

[75]  Matthias Hauswirth,et al.  Catch me if you can: performance bug detection in the wild , 2011, OOPSLA '11.

[76]  Ahmed E. Hassan,et al.  Detecting performance anti-patterns for applications developed using object-relational mapping , 2014, ICSE.

[77]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[78]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[79]  Stuart K. Card,et al.  Information foraging in information access environments , 1995, CHI '95.

[80]  Peter Hinz,et al.  Visualizing the performance of parallel programs , 1996 .

[81]  Marin Litoiu,et al.  A performance evaluation framework for Web applications , 2013, J. Softw. Evol. Process..

[82]  Barton P. Miller,et al.  What are race conditions?: Some issues and formalizations , 1992, LOPL.

[83]  Yiannakis Sazeides,et al.  Performance implications of single thread migration on a chip multi-core , 2005, CARN.

[84]  David Detlefs,et al.  Eliminating synchronization-related atomic operations with biased locking and bulk rebiasing , 2006, OOPSLA '06.

[85]  L. Snyder,et al.  Parallel Programming and the Poker Programming Environment , 1984, Computer.

[86]  Gunter Saake,et al.  Predicting performance via automated feature-interaction detection , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[87]  Timothy G. Mattson,et al.  Parallel programming: Can we PLEASE get it right this time? , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[88]  Frank Mueller,et al.  Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[89]  Nick Mitchell,et al.  Visualizing the Execution of Java Programs , 2001, Software Visualization.

[90]  Ying Zou,et al.  An Industrial Case Study on the Automated Detection of Performance Regressions in Heterogeneous Environments , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[91]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[92]  Ken Kennedy,et al.  Optimizing for parallelism and data locality , 1992 .