Large-Scale Program Behavior Analysis for Adaptation and Parallelization
暂无分享,去创建一个
[1] Kristof Beyls,et al. Reuse Distance as a Metric for Cache Behavior. , 2001 .
[2] Paul Feautrier,et al. Direct parallelization of call statements , 1986, SIGPLAN '86.
[3] Alan P. Batson,et al. Measurements of major locality phases in symbolic reference strings , 1976, SIGMETRICS '76.
[4] Diego R. Llanos Ferraris,et al. Design space exploration of a software speculative parallelization scheme , 2005, IEEE Transactions on Parallel and Distributed Systems.
[5] Alan L. Cox,et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.
[6] Trishul M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.
[7] J. Eliot B. Moss. Open Nested Transactions: Semantics and Support , 2006 .
[8] Brad Calder,et al. Selecting software phase markers with code structure analysis , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[9] Mateo Valero,et al. Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[10] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI '03.
[11] James R. Larus,et al. Cache-conscious structure definition , 1999, PLDI '99.
[12] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.
[13] N. Ranganathan,et al. A wire-delay scalable microprocessor architecture for high performance systems , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..
[14] J. T. Robinson,et al. On optimistic methods for concurrency control , 1979, TODS.
[15] David H. Bailey. Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015) , 1995, Sci. Program..
[16] Jeffrey D. Ullman,et al. Introduction to Automata Theory, Languages and Computation , 1979 .
[17] Monica S. Lam,et al. The design, implementation, and evaluation of Jade , 1998, TOPL.
[18] Steve Carr,et al. Instruction based memory distance analysis and its application to optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).
[19] Lawrence Rauchwerger,et al. The R-LRPD test: speculative parallelization of partially parallel loops , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.
[20] Zhiyuan Li,et al. An Efficient Data Dependence Analysis for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..
[21] James E. Smith,et al. Managing multi-configuration hardware via dynamic working set analysis , 2002, ISCA.
[22] Rajeev Balasubramonian,et al. Dynamically managing the communication-parallelism trade-off in future clustered processors , 2003, ISCA '03.
[23] Joel H. Saltz,et al. ICASE Report No . 92-12 / iVG / / ff 3 J / ICASE THE DESIGN AND IMPLEMENTATION OF A PARALLEL UNSTRUCTURED EULER SOLVER USING SOFTWARE PRIMITIVES , 2022 .
[24] Joel H. Saltz,et al. Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..
[25] Alan Jay Smith,et al. A Comparative Study of Set Associative Memory Mapping Algorithms and Their Use for Cache and Main Memory , 1978, IEEE Transactions on Software Engineering.
[26] Yan Solihin,et al. Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.
[27] Brad Calder,et al. Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.
[28] Ulrich Kremer,et al. The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction , 2003, PLDI '03.
[29] Michael A. Harrison,et al. Accurate static estimators for program optimization , 1994, PLDI '94.
[30] L. Rauchwerger,et al. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..
[31] Chen Ding,et al. Phase-Based Miss Rate Prediction Across Program Inputs , 2004, LCPC.
[32] Peter J. Denning,et al. Working Sets Past and Present , 1980, IEEE Transactions on Software Engineering.
[33] Chen Ding,et al. A Hierarchical Model of Reference Affinity , 2003, LCPC.
[34] Vivek Sarkar,et al. Determining average program execution times and their variance , 1989, PLDI '89.
[35] Ken Kennedy,et al. Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.
[36] Ken Kennedy,et al. A static performance estimator to guide data partitioning decisions , 1991, PPOPP '91.
[37] Chen Ding,et al. Gated memory control for memory monitoring, leak detection and garbage collection , 2005, MSP '05.
[38] Ken Kennedy,et al. An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..
[39] Thomas D. Burd,et al. Energy efficient CMOS microprocessor design , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.
[40] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[41] Santosh G. Abraham,et al. Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.
[42] Zhen Fang,et al. The Impulse Memory Controller , 2001, IEEE Trans. Computers.
[43] Donald E. Knuth,et al. An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..
[44] Kunle Olukotun,et al. The Jrpm system for dynamically parallelizing Java programs , 2003, ISCA '03.
[45] Susan J. Eggers,et al. Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.
[46] Krishna V. Palem,et al. Data remapping for design space optimization of embedded memory systems , 2003, TECS.
[47] Michael L. Scott,et al. Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor , 2003, ISCA '03.
[48] Trista Pei-chun Chen,et al. Computer Vision Workload Analysis: Case Study of Video Surveillance Systems , 2005 .
[49] Rajeev Balasubramonian,et al. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.
[50] Song Jiang,et al. LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance , 2002, SIGMETRICS '02.
[51] T. Mowry,et al. Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).
[52] L. Spaanenburg,et al. Design space exploration for a DT-CNN , 2008, 2008 11th International Workshop on Cellular Neural Networks and Their Applications.
[53] Chandra Krintz,et al. Cache-conscious data placement , 1998, ASPLOS VIII.
[54] Wei Liu,et al. EXPERT: expedited simulation exploiting program behavior repetition , 2004, ICS '04.
[55] Chen Ding,et al. Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.
[56] Vivek Sarkar,et al. Array SSA form and its use in parallelization , 1998, POPL '98.
[57] Lizy Kurian John,et al. Scaling to the end of silicon with EDGE architectures , 2004, Computer.
[58] Hwansoo Han,et al. Locality Optimizations For Adaptive Irregular Scientific Codes , 2000 .
[59] Chen Ding,et al. Lightweight reference affinity analysis , 2005, ICS '05.
[60] Monica S. Lam,et al. Data and computation transformations for multiprocessors , 1995, PPOPP '95.
[61] Laurie J. Hendren,et al. Context-sensitive interprocedural points-to analysis in the presence of function pointers , 1994, PLDI '94.
[62] Maurice Herlihy,et al. Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.
[63] Tarek S. Abdelrahman,et al. Run-Time Support for the Automatic Parallelization of Java Programs , 2004, The Journal of Supercomputing.
[64] Ken Kennedy,et al. Inter-array Data Regrouping , 1999, LCPC.
[65] Kristof Beyls,et al. Reuse Distance-Based Cache Hint Selection , 2002, Euro-Par.
[66] Ian H. Witten,et al. Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..
[67] John M. Mellor-Crummey,et al. Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.
[68] James R. Larus,et al. Cache-conscious structure layout , 1999, PLDI '99.
[69] Chandra Krintz,et al. Online phase detection algorithms , 2006, International Symposium on Code Generation and Optimization (CGO'06).
[70] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[71] Antonia Zhai,et al. The STAMPede approach to thread-level speculation , 2005, TOCS.
[72] Virendra J. Marathe,et al. A Qualitative Survey of Modern Software Transactional Memory Systems , 2004 .
[73] Ron Cytron,et al. Interprocedural dependence analysis and parallelization , 1986, SIGP.
[74] Brad Calder,et al. Phase tracking and prediction , 2003, ISCA '03.
[75] Todd C. Mowry,et al. Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, ISCA.
[76] R. Sarnath,et al. Proceedings of the International Conference on Parallel Processing , 1992 .
[77] Chen Ding,et al. Characterizing Phases in Service-Oriented Applications , 2004 .
[78] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .
[79] David A. Padua,et al. Calculating stack distances efficiently , 2002, MSP/ISMM.
[80] Chau-Wen Tseng,et al. Improving Locality for Adaptive Irregular Scientific Codes , 2000, LCPC.
[81] Xipeng Shen,et al. Adaptive data partition for sorting using probability distribution , 2004 .
[82] Ken Kennedy,et al. Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..
[83] Margaret Martonosi,et al. Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).
[84] Dimitri J. Mavriplis,et al. Design and implementation of a parallel unstructured Euler solver using software primitives , 1994 .
[85] John M. Mellor-Crummey,et al. Compile-time support for efficient data race detection in shared-memory parallel programs , 1993, PADD '93.
[86] Michael C. Huang,et al. Positional adaptation of processors: application to energy reduction , 2003, ISCA '03.
[87] James E. Smith,et al. Comparing program phase detection techniques , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..
[88] Guang R. Gao,et al. Designing the McCAT Compiler Based on a Family of Structured Intermediate Representations , 1992, LCPC.
[89] J. O. Rawlings,et al. Applied Regression Analysis: A Research Tool , 1988 .
[90] Larry Carter,et al. Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.
[91] Manish Gupta,et al. Techniques for Speculative Run-Time Parallelization of Loops , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[92] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global cache reuse , 2004, J. Parallel Distributed Comput..
[93] John Paul Shen,et al. The intrinsic bandwidth requirements of ordinary programs , 1996, ASPLOS VII.
[94] Markus Mock,et al. A retrospective on: "an evaluation of staged run-time optimizations in DyC" , 2004, SIGP.
[95] Nir Shavit,et al. Software transactional memory , 1995, PODC '95.
[96] Mikko H. Lipasti,et al. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.
[97] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse , 2000 .
[98] Chau-Wen Tseng,et al. Data transformations for eliminating conflict misses , 1998, PLDI.
[99] John Cocke,et al. A program data flow analysis procedure , 1976, CACM.
[100] J. Larus. Whole program paths , 1999, PLDI '99.
[101] Matthew L. Seidl,et al. Segregating heap objects by reference behavior and lifetime , 1998, ASPLOS VIII.
[102] Pedro C. Diniz,et al. A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.
[103] Mark N. Wegman,et al. Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.
[104] Alan Eustace,et al. ATOM - A System for Building Customized Program Analysis Tools , 1994, PLDI.
[105] Herwig W. Kressler. Evaluation of Potential , 2003 .
[106] Chen Ding,et al. Parallelization of Utility Programs Based on Behavior Phase Analysis , 2005, LCPC.
[107] Michael Voss,et al. High-level adaptive program optimization with ADAPT , 2001, PPoPP '01.
[108] Ingrid Daubechies,et al. Ten Lectures on Wavelets , 1992 .
[109] Michael D. Smith,et al. Procedure placement using temporal-ordering information , 1999, TOPL.
[110] Chen Ding,et al. Program-level adaptive memory management , 2006, ISMM '06.
[111] Maurice Herlihy,et al. Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.
[112] David H. Bailey. Unfavorable strides in cache memory systems , 1992 .
[113] Lawrence Rauchwerger,et al. The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.
[114] Larry Carter,et al. Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).
[115] Matthew Arnold,et al. A framework for reducing the cost of instrumented code , 2001, PLDI '01.
[116] Mark N. Wegman,et al. Constant propagation with conditional branches , 1985, POPL.
[117] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.
[118] David Padua,et al. Compile-time performance prediction of scientific programs , 2000 .
[119] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[120] Joel H. Saltz,et al. Run-Time Parallelization and Scheduling of Loops , 1991, IEEE Trans. Computers.
[121] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.
[122] Kai Li,et al. Shared virtual memory on loosely coupled multiprocessors , 1986 .
[123] Ken Kennedy,et al. Improving memory hierarchy performance for irregular applications , 1999, ICS '99.
[124] Chen Ding,et al. Locality phase prediction , 2004, ASPLOS XI.
[125] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..
[126] Maurice Herlihy,et al. Software transactional memory for dynamic-sized data structures , 2003, PODC '03.
[127] Chen Ding,et al. Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.
[128] Ron Cytron. Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.
[129] Olivier Temam,et al. Quantifying loop nest locality using SPEC'95 and the perfect benchmarks , 1999, TOCS.
[130] Milind Girkar,et al. On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings , 2006, ICS '06.
[131] Sandhya Dwarkadas,et al. Characterizing and predicting program behavior and its variability , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.
[132] Chen Ding,et al. Distance-based whole-program data locality hierarchy , 2005 .
[133] Chen Ding,et al. Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.
[134] Ken Kennedy,et al. Automatic data layout for distributed-memory machines , 1998, TOPL.
[135] Steve Carr,et al. Reuse-distance-based miss-rate prediction on a per instruction basis , 2004, MSP '04.
[136] Robert Wahbe,et al. Practical data breakpoints: design and implementation , 1993, PLDI '93.
[137] Sally A. McKee,et al. Restructuring Computations for Temporal Data Cache Locality , 2003, International Journal of Parallel Programming.
[138] Christos H. Papadimitriou,et al. The serializability of concurrent database updates , 1979, JACM.
[139] Hans-Juergen Boehm,et al. HP Laboratories , 2006 .
[140] Robert H. Halstead,et al. MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.
[141] Keith D. Cooper,et al. Engineering a Compiler , 2003 .
[142] Yuanyuan Zhou,et al. The Multi-Queue Replacement Algorithm for Second Level Buffer Caches , 2001, USENIX Annual Technical Conference, General Track.
[143] Khalid Omar Thabit,et al. Cache management by the compiler , 1982 .
[144] Ken Kennedy,et al. Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.
[145] Peter J. Keleher,et al. A Protocol-Centric Approach to on-the-Fly Race Detection , 2000, IEEE Trans. Parallel Distributed Syst..
[146] Chen Ding,et al. Regression-Based Multi-Model Prediction of Data Reuse Signature , 2003 .
[147] Lieven Eeckhout,et al. Method-level phase behavior in java workloads , 2004, OOPSLA.
[148] Kathryn S. McKinley. Polar opposites: next generation languages and architectures , 2004, MSP '04.