Generalization of the decremental performance analysis to differential analysis. (Généralisation de l'analyse de performance décrémentale vers l'analyse différentielle)

Une des etapes les plus cruciales dans le processus d’analyse des performances d’une application est la detection des goulets d’etranglement. Un goulet etant tout evenement qui contribue a l’allongement temps d’execution, la detection de ses causes est importante pour les developpeurs d’applications afin de comprendre les defauts de conception et de generation de code. Cependant, la detection de goulets devient un art difficile. Dans le passe, des techniques qui reposaient sur le comptage du nombre d’evenements, arrivaient facilement a trouver les goulets. Maintenant, la complexite accrue des micro-architectures modernes et l’introduction de plusieurs niveaux de parallelisme ont rendu ces techniques beaucoup moins efficaces. Par consequent, il y a un reel besoin de reflexion sur de nouvelles approches.Notre travail porte sur le developpement d’outils d’evaluation de performance des boucles de calculs issues d’applications scientifiques. Nous travaillons sur Decan, un outil d’analyse de performance qui presente une approche interessante et prometteuse appelee l’Analyse Decrementale. Decan repose sur l’idee d’effectuer des changements controles sur les boucles du programme et de comparer la version obtenue (appelee variante) avec la version originale, permettant ainsi de detecter la presence ou pas de goulets d’etranglement.Tout d’abord, nous avons enrichi Decan avec de nouvelles variantes, que nous avons concues, testees et validees. Ces variantes sont, par la suite, integrees dans une analyse de performance poussee appelee l’Analyse Differentielle. Nous avons integre l’outil et l’analyse dans une methodologie d’analyse de performance plus globale appelee Pamda.Nous decrirons aussi les differents apports a l’outil Decan. Sont particulierement detaillees les techniques de preservation des structures de controle du programme,ainsi que l’ajout du support pour les programmes paralleles.Finalement, nous effectuons une etude statistique qui permet de verifier la possibilite d’utiliser des compteurs d’evenements, autres que le temps d’execution, comme metriques de comparaison entre les variantes Decan

[1]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[2]  Ayal Zaks,et al.  Swing Modulo Scheduling for GCC , 2004 .

[3]  William Jalby,et al.  Hardware Performance Monitoring for the Rest of Us: A Position and Survey , 2011, NPC.

[4]  James Goodman,et al.  MESIF: A Two-Hop Cache Coherency Protocol for Point-to-Point Interconnects (2004) , 2004 .

[5]  Qin Zhao,et al.  Transparent dynamic instrumentation , 2012, VEE '12.

[6]  Andres Charif Rubial,et al.  Performance Tuning of x86 OpenMP Codes with MAQAO , 2009, Parallel Tools Workshop.

[7]  Norman P. Jouppi,et al.  Multi-Core Cache Hierarchies , 2011, Multi-Core Cache Hierarchies.

[8]  S. Eranian Perfmon2: a flexible performance monitoring interface for Linux , 2010 .

[9]  Mary Lou Soffa,et al.  Low overhead program monitoring and profiling , 2005, PASTE '05.

[10]  Shai Rubin,et al.  Focusing processor policies via critical-path prediction , 2001, ISCA 2001.

[11]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[12]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[13]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[14]  William Jalby,et al.  MicroTools: Automating Program Generation and Performance Measurement , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[15]  Barbara Chapman,et al.  Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation) , 2007 .

[16]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[17]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[18]  John Paul Shen,et al.  Theoretical modeling of superscalar processor performance , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[20]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[21]  Andrs Vajda Programming Many-Core Chips , 2011 .

[22]  Stefanos Kaxiras,et al.  Interval-based models for run-time DVFS orchestration in superscalar processors , 2010, CF '10.

[23]  William Jalby,et al.  Quantifying performance bottleneck cost through differential analysis , 2013, ICS '13.

[24]  Sangkyum Kim,et al.  ADP: automated diagnosis of performance pathologies using hardware events , 2012, SIGMETRICS '12.

[25]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[26]  Yu Chen,et al.  A New Algorithm for Identifying Loops in Decompilation , 2007, SAS.

[27]  Andres Charif Rubial,et al.  CQA: A code quality analyzer tool at binary level , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[28]  R. Campbell,et al.  Automated Fingerprinting of Performance Pathologies Using Performance Monitoring Units ( PMUs ) , 2011 .

[29]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[30]  John M. Mellor-Crummey,et al.  A new approach for performance analysis of openMP programs , 2013, ICS '13.

[31]  William Jalby,et al.  A Balanced Approach to Application Performance Tuning , 2009, LCPC.

[32]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[33]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[34]  B. Schimmelpfennig,et al.  Quantum chemical and molecular dynamics study of the coordination of Th(IV) in aqueous solvent. , 2010, The journal of physical chemistry. B.

[35]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[36]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[37]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Edward M. McCreight The Dragon Computer System , 1985 .

[39]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[40]  Dhabaleswar K. Panda,et al.  Microbenchmark performance comparison of high-speed cluster interconnects , 2004, IEEE Micro.

[41]  Martin Burtscher,et al.  AutoSCOPE : Automatic Suggestions for Code Optimizations using PerfExpert , 2011 .

[42]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[43]  Shirley Moore,et al.  Non-determinism and overcount on modern hardware performance counter implementations , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[44]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[45]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[46]  Kim M. Hazelwood,et al.  SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[47]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org , 2010 .

[48]  Andres Charif Rubial,et al.  MIL: A language to build program analysis tools through static binary instrumentation , 2013, 20th Annual International Conference on High Performance Computing.

[49]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[50]  Rastislav Bodík,et al.  Slack: maximizing performance under technological constraints , 2002, ISCA.

[51]  Cédric Valensi A generic approach to the definition of low-level components for multi-architecture binary analysis , 2014 .

[52]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[53]  Michael Laurenzano,et al.  PEBIL: Efficient static binary instrumentation for Linux , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[54]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[55]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[56]  Matthias Hauswirth,et al.  Accuracy of performance counter measurements , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[57]  David J. Kuck Computational Capacity-Based Codesign of Computer Systems , 2012, High-Performance Scientific Computing.

[58]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[59]  Rastislav Bodík,et al.  Using Interaction Costs for Microarchitectural Bottleneck Analysis , 2003, MICRO.

[60]  Sadaf R. Alam,et al.  Characterization of Scientific Workloads on Systems with Multi-Core Processors , 2006, 2006 IEEE International Symposium on Workload Characterization.

[61]  Jack Dongarra,et al.  Integrated Tool Capabilities for Performance Instrumentation and Measurement , .

[62]  Sally A. McKee,et al.  Can hardware performance counters be trusted? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[63]  William Jalby,et al.  Simsys: a performance simulation framework , 2013, RAPIDO '13.

[64]  Iro Pantazi-Mytarelli The history and use of pipelining computer architecture: MIPS pipelining implementation , 2013, 2013 IEEE Long Island Systems, Applications and Technology Conference (LISAT).

[65]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[66]  Rastislav Bodík,et al.  Interaction cost: for when event counts just don't add up , 2004, IEEE Micro.