Performance evaluation of applications for heterogeneous systems by means of performance probes

This doctoral Thesis describes a novel way to select the best computer node out of a pool of available potentially heterogeneous computing nodes for the execution of computational tasks. This is a very basic and dificult problem of computer science and computing centres tried to get around it by using only homogeneous compute clusters. Usually this fails as like any technical equipment, clusters get extended, adapted or repaired over time, and you end up with a heterogeneous configuration. So far, the solution for this, was: • To leave it to the computer users to choose the right node(s) for execution, or •To make extensive tests by executing and measuring all tasks on every type of computing node available in the pool. In the typical case, where a large number of tasks would need to be tested on many different types of nodes, this could use a lot of computing resources, sometimes even more than the actual execution one wants to optimize. In a specific situation (hierarchical multi-clusters), the situation is worse, as the configuration of the cluster changes over time, so that the execution tests would have to be done over and over, every time the configuration of the cluster is changed. I developed a novel and elegant solution for this problem, named "Performance Probe", or just "Probe", for short. A probe is a striped-down version of a compu- tational task which includes all important characteristics of the original task, but can be executed in a much shorter time (seconds, instead of hours), is much smaller than the original task (about 5% of the original size in the worst cases), but allows to predict the execution time of the original within reasonable bounds (around 90% accuracy). These results are very important: as scheduling is a basic problem of computer science, these results cannot only be used in the setting described by the thesis (of setting the right compute node for tasks in a hierarchical multi-cluster), but can also be applied in many diferent contexts every time scheduling and/or selection decisions have to be made: selecting where a computational task would run most efficiently (which cluster at which centre); picking the right execution nodes in a large complex (grid, cloud), work ows and many more.

[1]  Lieven Eeckhout,et al.  Automated microprocessor stressmark generation , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[2]  Lee Wang,et al.  Data Parallel Algorithms , 1994 .

[3]  Emilio Luque,et al.  Tuning Application in a Multi-cluster Environment , 2006, Euro-Par.

[4]  Myeong-Cheol Ko,et al.  CPOC: Effective Static Task Scheduling for Grid Computing , 2005, HPCC.

[5]  David H. Bailey Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015) , 1995, Sci. Program..

[6]  Brad Calder,et al.  Discovering and Exploiting Program Phases , 2003, IEEE Micro.

[7]  Zhiling Lan,et al.  A fast restart mechanism for checkpoint/recovery protocols in networked environments , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[8]  Emilio Luque,et al.  Software probes: towards a quick method for machine characterization and application performance prediction , 2008, 2008 International Symposium on Parallel and Distributed Computing.

[9]  Barton P. Miller,et al.  Dynamic program instrumentation for scalable performance tools , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[10]  Jeffrey K. Hollingsworth,et al.  EMPS: an environment for memory performance studies , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[11]  James E. Smith,et al.  Comparing program phase detection techniques , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[12]  Emilio Luque,et al.  Parallel application signature , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[13]  Sally A. McKee,et al.  Using Dynamic Binary Instrumentation to Generate Multi-platform SimPoints: Methodology and Accuracy , 2008, HiPEAC.

[14]  Erich Strohmaier,et al.  Apex-Map: A Synthetic Scalable Benchmark Probe to Explore Data Access Performance on Highly Parallel Systems , 2005, Euro-Par.

[15]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[16]  Emilio Luque,et al.  Software Probes: A Method for Quickly Characterizing Applications' Performance on Heterogeneous Environments , 2009, 2009 International Conference on Parallel Processing Workshops.

[17]  Basilio B. Fraguela,et al.  Precise automatable analytical modeling of the cache behavior of codes with indirections , 2007, TACO.

[18]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[19]  Emilio Luque,et al.  Improving Probe Usability , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[20]  Leonid Oliker,et al.  Identifying performance bottlenecks on modern microarchitectures using an adaptable probe , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[21]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[22]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[23]  Reinhold Weicker,et al.  Dhrystone: a synthetic systems programming benchmark , 1984, CACM.

[24]  Erich Strohmaier,et al.  APEX‐Map: a parameterized scalable memory access probe for high‐performance computing systems , 2007, Concurr. Comput. Pract. Exp..

[25]  E. M. O. Junior,et al.  Performance prediction and tuning in a multi-cluster environment , 2006 .

[26]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[27]  Jesús Labarta,et al.  Performance Modeling of HPC Applications , 2003, PARCO.

[28]  Jaspal Subhlok,et al.  Replicating memory behavior for performance prediction , 2004 .

[29]  Lieven Eeckhout,et al.  Microarchitecture-Independent Workload Characterization , 2007, IEEE Micro.

[30]  Zhiling Lan,et al.  FREM: A Fast Restart Mechanism for General Checkpoint/Restart , 2011, IEEE Transactions on Computers.

[31]  Srinivas Aluru,et al.  Practical Algorithms for Selection on Coarse-Grained Parallel Computers , 1997, IEEE Trans. Parallel Distributed Syst..

[32]  Basilio B. Fraguela,et al.  Analytical modeling of codes with arbitrary data-dependent conditional structures , 2006, J. Syst. Archit..

[33]  Sally A. McKee,et al.  Can hardware performance counters be trusted? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[34]  Brad Calder,et al.  Motivation for Variable Length Intervals and Hierarchical Phase Behavior , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[35]  Kevin Skadron,et al.  Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation , 2003, 2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003..

[36]  Emilio L. Zapata,et al.  Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance , 2003, IEEE Trans. Computers.

[37]  Erich Strohmaier,et al.  Quantifying Locality In The Memory Access Patterns of HPC Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[38]  Juan Touriño,et al.  Automated and accurate cache behavior analysis for codes with irregular access patterns , 2007, Concurr. Comput. Pract. Exp..

[39]  Jens Volkert,et al.  Adaps - A three-phase adaptive prediction system for the run-time of jobs based on user behaviour , 2011, J. Comput. Syst. Sci..

[40]  Janak H. Patel,et al.  Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems , 1988, IEEE Trans. Computers.

[41]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[42]  Peter M. Kogge,et al.  On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications , 2007, IEEE Transactions on Computers.

[43]  Craig A. Lee,et al.  Cluster performance and the implications for distributed, heterogeneous grid performance , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[44]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[45]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[46]  Lieven Eeckhout,et al.  Distilling the essence of proprietary workloads into miniature benchmarks , 2008, TACO.

[47]  Robert Kroeger,et al.  A case study in top-down performance estimation for a large-scale parallel application , 2006, PPoPP '06.

[48]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[49]  Adolfy Hoisie,et al.  Scalability analysis of multidimensional wavefront algorithms on large-scale SMP clusters , 1999, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[50]  Erich Schikuta,et al.  Grid Workflow Optimization Regarding Dynamically Changing Resources and Conditions , 2007, GCC.

[51]  Jack J. Dongarra,et al.  The LINPACK Benchmark: An Explanation , 1988, ICS.

[52]  Brad Calder,et al.  Picking statistically valid and early simulation points , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[53]  Jaspal Subhlok,et al.  Skeleton based performance prediction on shared networks , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[54]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[55]  Patricia J. Teller,et al.  Just how accurate are performance counters? , 2001, Conference Proceedings of the 2001 IEEE International Performance, Computing, and Communications Conference (Cat. No.01CH37210).

[56]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[57]  Mohammed J. Zaki,et al.  Compile-Time Scheduling Algorithms for a Heterogeneous Network of Workstations , 1997, Comput. J..

[58]  Ophir Frieder,et al.  Clustering and classification of large document bases in a parallel environment , 1997 .

[59]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[60]  Daniel S. Katz,et al.  A comparison of two methods for building astronomical image mosaics on a grid , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[61]  Paula Cecilia Fritzsche Podemos predecir en algoritmos paralelos no deterministas , 2007 .

[62]  Brad Calder,et al.  The Strong correlation Between Code Signatures and Performance , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[63]  Joseph Mohan,et al.  Experience with Two Parallel Programs Solving the Traveling Salesman Problem , 1983, ICPP.

[64]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[65]  Jaspal Subhlok,et al.  Automatic node selection for high performance applications on networks , 1999, PPoPP '99.

[66]  Lizy K. John,et al.  Performance prediction using program similarity , 2006 .

[67]  Jorge G. Barbosa,et al.  Linear algebra algorithms in a heterogeneous cluster of personal computers , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[68]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[69]  Adolfy Hoisie,et al.  Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications , 2000, Int. J. High Perform. Comput. Appl..

[70]  Jaspal Subhlok,et al.  Automatic construction and evaluation of performance skeletons , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[71]  Lieven Eeckhout,et al.  Performance Evaluation and Benchmarking , 2005 .

[72]  Qiang Xu,et al.  Performance prediction with skeletons , 2008, Cluster Computing.

[73]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[74]  Brad Calder,et al.  How to use SimPoint to pick simulation points , 2004, PERV.

[75]  Brad Calder,et al.  Structures for phase classification , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[76]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[77]  Brian A. Wichmann,et al.  A Synthetic Benchmark , 1976, Comput. J..