Large-Scale Program Behavior Analysis for Adaptation and Parallelization

Motivated by the relentless quest for program performance and energy savings, program execution environments (e.g. computer architecture and operating systems) are becoming reconfigurable and adaptive. But most programs are not: despite dramatic differences in inputs, machine configurations, and the workload of the underlying operating systems, most programs always have the same code running with the same data structure. The resulting mismatch between program and environment often leads to execution slowdown and resource under-utilization. The problem is exacerbated as chip multi-processors are becoming commonplace and most user programs are still sequential, increasingly composed with library code and running with interpreters and virtual machines. The ultimate goal of my research is an intelligent programming system, which injects into a program the ability to automatically adapt and evolve its code and data and configure its running environment in order to achieve a better match between the (improved) program, its input, and the environment. Program adaptation is not possible without accurately forecasting a program’s behavior. However, traditional modular program design and analysis are ill-fitted for finding large-scale composite patterns in increasingly complicated code, dynamically allocated data, and multi-layered execution environments (e.g. interpreters, virtual machines, operating systems and computer architecture.) My research views a program as a composition of large-scale behavior patterns, each of which may span a large number of loops and procedures statically and billions of instructions dynamically. I apply statistical technology to automatically recognize the patterns, build models of program

[1]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[2]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[3]  Alan P. Batson,et al.  Measurements of major locality phases in symbolic reference strings , 1976, SIGMETRICS '76.

[4]  Diego R. Llanos Ferraris,et al.  Design space exploration of a software speculative parallelization scheme , 2005, IEEE Transactions on Parallel and Distributed Systems.

[5]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[6]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[7]  J. Eliot B. Moss Open Nested Transactions: Semantics and Support , 2006 .

[8]  Brad Calder,et al.  Selecting software phase markers with code structure analysis , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[9]  Mateo Valero,et al.  Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI '03.

[11]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[12]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[13]  N. Ranganathan,et al.  A wire-delay scalable microprocessor architecture for high performance systems , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[14]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[15]  David H. Bailey Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015) , 1995, Sci. Program..

[16]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[17]  Monica S. Lam,et al.  The design, implementation, and evaluation of Jade , 1998, TOPL.

[18]  Steve Carr,et al.  Instruction based memory distance analysis and its application to optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[19]  Lawrence Rauchwerger,et al.  The R-LRPD test: speculative parallelization of partially parallel loops , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[20]  Zhiyuan Li,et al.  An Efficient Data Dependence Analysis for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..

[21]  James E. Smith,et al.  Managing multi-configuration hardware via dynamic working set analysis , 2002, ISCA.

[22]  Rajeev Balasubramonian,et al.  Dynamically managing the communication-parallelism trade-off in future clustered processors , 2003, ISCA '03.

[23]  Joel H. Saltz,et al.  ICASE Report No . 92-12 / iVG / / ff 3 J / ICASE THE DESIGN AND IMPLEMENTATION OF A PARALLEL UNSTRUCTURED EULER SOLVER USING SOFTWARE PRIMITIVES , 2022 .

[24]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[25]  Alan Jay Smith,et al.  A Comparative Study of Set Associative Memory Mapping Algorithms and Their Use for Cache and Main Memory , 1978, IEEE Transactions on Software Engineering.

[26]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[27]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[28]  Ulrich Kremer,et al.  The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction , 2003, PLDI '03.

[29]  Michael A. Harrison,et al.  Accurate static estimators for program optimization , 1994, PLDI '94.

[30]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[31]  Chen Ding,et al.  Phase-Based Miss Rate Prediction Across Program Inputs , 2004, LCPC.

[32]  Peter J. Denning,et al.  Working Sets Past and Present , 1980, IEEE Transactions on Software Engineering.

[33]  Chen Ding,et al.  A Hierarchical Model of Reference Affinity , 2003, LCPC.

[34]  Vivek Sarkar,et al.  Determining average program execution times and their variance , 1989, PLDI '89.

[35]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[36]  Ken Kennedy,et al.  A static performance estimator to guide data partitioning decisions , 1991, PPOPP '91.

[37]  Chen Ding,et al.  Gated memory control for memory monitoring, leak detection and garbage collection , 2005, MSP '05.

[38]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[39]  Thomas D. Burd,et al.  Energy efficient CMOS microprocessor design , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[40]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[41]  Santosh G. Abraham,et al.  Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.

[42]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[43]  Donald E. Knuth,et al.  An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[44]  Kunle Olukotun,et al.  The Jrpm system for dynamically parallelizing Java programs , 2003, ISCA '03.

[45]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[46]  Krishna V. Palem,et al.  Data remapping for design space optimization of embedded memory systems , 2003, TECS.

[47]  Michael L. Scott,et al.  Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor , 2003, ISCA '03.

[48]  Trista Pei-chun Chen,et al.  Computer Vision Workload Analysis: Case Study of Video Surveillance Systems , 2005 .

[49]  Rajeev Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.

[50]  Song Jiang,et al.  LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance , 2002, SIGMETRICS '02.

[51]  T. Mowry,et al.  Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[52]  L. Spaanenburg,et al.  Design space exploration for a DT-CNN , 2008, 2008 11th International Workshop on Cellular Neural Networks and Their Applications.

[53]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[54]  Wei Liu,et al.  EXPERT: expedited simulation exploiting program behavior repetition , 2004, ICS '04.

[55]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[56]  Vivek Sarkar,et al.  Array SSA form and its use in parallelization , 1998, POPL '98.

[57]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[58]  Hwansoo Han,et al.  Locality Optimizations For Adaptive Irregular Scientific Codes , 2000 .

[59]  Chen Ding,et al.  Lightweight reference affinity analysis , 2005, ICS '05.

[60]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[61]  Laurie J. Hendren,et al.  Context-sensitive interprocedural points-to analysis in the presence of function pointers , 1994, PLDI '94.

[62]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[63]  Tarek S. Abdelrahman,et al.  Run-Time Support for the Automatic Parallelization of Java Programs , 2004, The Journal of Supercomputing.

[64]  Ken Kennedy,et al.  Inter-array Data Regrouping , 1999, LCPC.

[65]  Kristof Beyls,et al.  Reuse Distance-Based Cache Hint Selection , 2002, Euro-Par.

[66]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[67]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[68]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[69]  Chandra Krintz,et al.  Online phase detection algorithms , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[70]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[71]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.

[72]  Virendra J. Marathe,et al.  A Qualitative Survey of Modern Software Transactional Memory Systems , 2004 .

[73]  Ron Cytron,et al.  Interprocedural dependence analysis and parallelization , 1986, SIGP.

[74]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[75]  Todd C. Mowry,et al.  Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, ISCA.

[76]  R. Sarnath,et al.  Proceedings of the International Conference on Parallel Processing , 1992 .

[77]  Chen Ding,et al.  Characterizing Phases in Service-Oriented Applications , 2004 .

[78]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[79]  David A. Padua,et al.  Calculating stack distances efficiently , 2002, MSP/ISMM.

[80]  Chau-Wen Tseng,et al.  Improving Locality for Adaptive Irregular Scientific Codes , 2000, LCPC.

[81]  Xipeng Shen,et al.  Adaptive data partition for sorting using probability distribution , 2004 .

[82]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[83]  Margaret Martonosi,et al.  Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[84]  Dimitri J. Mavriplis,et al.  Design and implementation of a parallel unstructured Euler solver using software primitives , 1994 .

[85]  John M. Mellor-Crummey,et al.  Compile-time support for efficient data race detection in shared-memory parallel programs , 1993, PADD '93.

[86]  Michael C. Huang,et al.  Positional adaptation of processors: application to energy reduction , 2003, ISCA '03.

[87]  James E. Smith,et al.  Comparing program phase detection techniques , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[88]  Guang R. Gao,et al.  Designing the McCAT Compiler Based on a Family of Structured Intermediate Representations , 1992, LCPC.

[89]  J. O. Rawlings,et al.  Applied Regression Analysis: A Research Tool , 1988 .

[90]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[91]  Manish Gupta,et al.  Techniques for Speculative Run-Time Parallelization of Loops , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[92]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2004, J. Parallel Distributed Comput..

[93]  John Paul Shen,et al.  The intrinsic bandwidth requirements of ordinary programs , 1996, ASPLOS VII.

[94]  Markus Mock,et al.  A retrospective on: "an evaluation of staged run-time optimizations in DyC" , 2004, SIGP.

[95]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[96]  Mikko H. Lipasti,et al.  Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[97]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse , 2000 .

[98]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[99]  John Cocke,et al.  A program data flow analysis procedure , 1976, CACM.

[100]  J. Larus Whole program paths , 1999, PLDI '99.

[101]  Matthew L. Seidl,et al.  Segregating heap objects by reference behavior and lifetime , 1998, ASPLOS VIII.

[102]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[103]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[104]  Alan Eustace,et al.  ATOM - A System for Building Customized Program Analysis Tools , 1994, PLDI.

[105]  Herwig W. Kressler Evaluation of Potential , 2003 .

[106]  Chen Ding,et al.  Parallelization of Utility Programs Based on Behavior Phase Analysis , 2005, LCPC.

[107]  Michael Voss,et al.  High-level adaptive program optimization with ADAPT , 2001, PPoPP '01.

[108]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[109]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[110]  Chen Ding,et al.  Program-level adaptive memory management , 2006, ISMM '06.

[111]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[112]  David H. Bailey Unfavorable strides in cache memory systems , 1992 .

[113]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[114]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[115]  Matthew Arnold,et al.  A framework for reducing the cost of instrumented code , 2001, PLDI '01.

[116]  Mark N. Wegman,et al.  Constant propagation with conditional branches , 1985, POPL.

[117]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[118]  David Padua,et al.  Compile-time performance prediction of scientific programs , 2000 .

[119]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[120]  Joel H. Saltz,et al.  Run-Time Parallelization and Scheduling of Loops , 1991, IEEE Trans. Computers.

[121]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[122]  Kai Li,et al.  Shared virtual memory on loosely coupled multiprocessors , 1986 .

[123]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[124]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[125]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[126]  Maurice Herlihy,et al.  Software transactional memory for dynamic-sized data structures , 2003, PODC '03.

[127]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[128]  Ron Cytron Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[129]  Olivier Temam,et al.  Quantifying loop nest locality using SPEC'95 and the perfect benchmarks , 1999, TOCS.

[130]  Milind Girkar,et al.  On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings , 2006, ICS '06.

[131]  Sandhya Dwarkadas,et al.  Characterizing and predicting program behavior and its variability , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[132]  Chen Ding,et al.  Distance-based whole-program data locality hierarchy , 2005 .

[133]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[134]  Ken Kennedy,et al.  Automatic data layout for distributed-memory machines , 1998, TOPL.

[135]  Steve Carr,et al.  Reuse-distance-based miss-rate prediction on a per instruction basis , 2004, MSP '04.

[136]  Robert Wahbe,et al.  Practical data breakpoints: design and implementation , 1993, PLDI '93.

[137]  Sally A. McKee,et al.  Restructuring Computations for Temporal Data Cache Locality , 2003, International Journal of Parallel Programming.

[138]  Christos H. Papadimitriou,et al.  The serializability of concurrent database updates , 1979, JACM.

[139]  Hans-Juergen Boehm,et al.  HP Laboratories , 2006 .

[140]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[141]  Keith D. Cooper,et al.  Engineering a Compiler , 2003 .

[142]  Yuanyuan Zhou,et al.  The Multi-Queue Replacement Algorithm for Second Level Buffer Caches , 2001, USENIX Annual Technical Conference, General Track.

[143]  Khalid Omar Thabit,et al.  Cache management by the compiler , 1982 .

[144]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[145]  Peter J. Keleher,et al.  A Protocol-Centric Approach to on-the-Fly Race Detection , 2000, IEEE Trans. Parallel Distributed Syst..

[146]  Chen Ding,et al.  Regression-Based Multi-Model Prediction of Data Reuse Signature , 2003 .

[147]  Lieven Eeckhout,et al.  Method-level phase behavior in java workloads , 2004, OOPSLA.

[148]  Kathryn S. McKinley Polar opposites: next generation languages and architectures , 2004, MSP '04.