Optimizing partitioned global address space programs for cluster architectures
暂无分享,去创建一个
[1] Katherine A. Yelick,et al. Titanium Performance and Potential: An NPB Experimental Study , 2005, LCPC.
[2] Monica S. Lam,et al. An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.
[3] John M. Mellor-Crummey,et al. Effective communication coalescing for data-parallel applications , 2005, PPOPP.
[4] José Nelson Amaral,et al. Shared memory programming for large scale machines , 2006, PLDI '06.
[5] Edith Schonberg,et al. An HPF Compiler for the IBM SP2 , 1995, Proceedings of the IEEE/ACM SC95 Conference.
[6] Costin Iancu,et al. HUNTing the overlap , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).
[7] Laurie J. Hendren,et al. Locality Analysis for Parallel C Programs , 1999, IEEE Trans. Parallel Distributed Syst..
[8] D. Martin Swany,et al. Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[9] John A. Chandy,et al. The Paradigm Compiler for Distributed-Memory Multicomputers , 1995, Computer.
[10] Raymond Lo,et al. Strength Reduction via SSAPRE , 1998, CC.
[11] Barton P. Miller,et al. What are race conditions?: Some issues and formalizations , 1992, LOPL.
[12] T. von Eicken,et al. Parallel programming in Split-C , 1993, Supercomputing '93.
[13] Katherine A. Yelick,et al. Analyses and Optimizations for Shared Address Space Programs , 1996, J. Parallel Distributed Comput..
[14] Monica S. Lam,et al. Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.
[15] David J. Lilja,et al. Data prefetch mechanisms , 2000, CSUR.
[16] Lawrence Rauchwerger,et al. Automatic Detection of Parallelism: A grand challenge for high performance computing , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.
[17] Mahmut T. Kandemir,et al. A global communication optimization technique based on data-flow analysis and linear algebra , 1999, TOPL.
[18] Zhang Zhang,et al. Benchmark measurements of current UPC platforms , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[19] Jimmy Su,et al. Making Sequential Consistency Practical in Titanium , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[20] Monica S. Lam,et al. Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..
[21] Liviu Iftode,et al. Shared virtual memory: progress and challenges , 1999 .
[22] Ken Kennedy,et al. Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.
[23] Xin Yuan,et al. Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.
[24] Xin Yuan,et al. STAR-MPI: self tuned adaptive routines for MPI collective operations , 2006, ICS '06.
[25] Ahmad Faraj,et al. Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.
[26] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.
[27] Ching-Hsien Hsu,et al. A Generalized Processor Mapping Technique for Array Redistribution , 2001, IEEE Trans. Parallel Distributed Syst..
[28] Rudolf Eigenmann,et al. Automatic program parallelization , 1993, Proc. IEEE.
[29] Chau-Wen Tseng,et al. Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.
[30] Katherine A. Yelick,et al. Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..
[31] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.
[32] David A. Padua,et al. Basic compiler algorithms for parallel programs , 1999, PPoPP '99.
[33] Katherine A. Yelick,et al. Polynomial-Time Algorithms for Enforcing Sequential Consistency in SPMD Programs with Arrays , 2003, LCPC.
[34] Tarek A. El-Ghazawi,et al. UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).
[35] Tarek S. Abdelrahman,et al. Computation-Communication Overlap on Network-of-Workstation Multiprocessors , 2001 .
[36] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..
[37] Alexander Aiken,et al. Type systems for distributed data structures , 2000, POPL '00.
[38] David A. Padua,et al. Concurrent Static Single Assignment Form and Constant Propagation for Explicitly Parallel Programs , 1997, LCPC.
[39] Dan Bonachea. Proposal for extending the upc memory copy library functions and supporting extensions to gasnet , 2004 .
[40] Mahmut T. Kandemir,et al. Minimizing Data and Synchronization Costs in One-Way Communication , 2000, IEEE Trans. Parallel Distributed Syst..
[41] Katherine Yelick,et al. UPC Language Specifications V1.1.1 , 2003 .
[42] Erich Strohmaier,et al. Optimizing communication overlap for high-speed networks , 2007, PPoPP.
[43] Monica S. Lam,et al. Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.
[44] Prithviraj Banerjee,et al. Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers , 1995, ICS '95.
[45] Michael Wolfe,et al. A New Approach to Array Redistribution: Strip Mining Redistribution , 1994, PARLE.
[46] Jason Duell. Pthreads or Processes : Which is Better for Implementing Global Address Space languages ? , 2007 .
[47] Martin Hirzel,et al. Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.
[48] Jong-Deok Choi,et al. Global communication analysis and optimization , 1996, PLDI '96.
[49] Willy Zwaenepoel,et al. Implementation and performance of Munin , 1991, SOSP '91.
[50] Weiyu Chen. Building a Source-to-Source UPC-toC Translator , 2004 .
[51] Bernard Tourancheau,et al. The Design for a High-Performance MPI Implementation on the Myrinet Network , 1999, PVM/MPI.
[52] Alan L. Cox,et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.
[53] Roland WestrelinLHPC,et al. Modeling of a high speed network to maximize throughputperformance : the experience of BIP over MyrinetLoic , 1997 .
[54] Katherine A. Yelick,et al. Concurrency Analysis for Parallel Programs with Textually Aligned Barriers , 2005, LCPC.
[55] Dan Bonachea. GASNet Specification, v1.1 , 2002 .
[56] Raymond Lo,et al. A new algorithm for partial redundancy elimination based on SSA form , 1997, PLDI '97.
[57] Jack Dongarra,et al. Introduction to the HPCChallenge Benchmark Suite , 2004 .
[58] Laurie J. Hendren,et al. Communication optimizations for parallel C programs , 1998, J. Parallel Distributed Comput..
[59] Katherine A. Yelick,et al. Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[60] Katherine Yelick,et al. Titanium Language Reference Manual , 2001 .
[61] John M. Mellor-Crummey,et al. A Multi-Platform Co-Array Fortran Compiler , 2004, IEEE PACT.
[62] Jimmy Su,et al. Array prefetching for irregular array accesses in Titanium , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[63] Xin Yuan,et al. CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.
[64] Monica S. Lam,et al. Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.
[65] Michael E. Wolf,et al. Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.
[66] David Gay,et al. Barrier inference , 1998, POPL '98.
[67] Ken Kennedy,et al. Efficient address generation for block-cyclic distributions , 1995, ICS '95.
[68] Dhabaleswar K. Panda,et al. Protocols and strategies for optimizing performance of remote memory operations on clusters , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.
[69] Raymond Lo,et al. Register promotion by sparse partial redundancy elimination of loads and stores , 1998, PLDI.
[70] Monica S. Lam,et al. RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .
[71] Geoffrey C. Fox,et al. A Compilation Approach for Fortran 90D/HPF Compilers on Distributed Memory MIMD Computers , 1993 .
[72] Raymond Lo,et al. Effective Representation of Aliases and Indirect Memory Operations in SSA Form , 1996, CC.
[73] Andrea C. Arpaci-Dusseau,et al. Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.
[74] Paul D. Gader,et al. Image algebra techniques for parallel image processing , 1987 .
[75] Yunheung Paek,et al. Efficient and precise array access analysis , 2002, TOPL.
[76] Victor Luchangco,et al. The Fortress Language Specification Version 1.0 , 2007 .
[77] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[78] Tarek A. El-Ghazawi,et al. An evaluation of global address space languages: co-array fortran and unified parallel C , 2005, PPoPP.
[79] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.
[80] John M. Mellor-Crummey,et al. Co-array Fortran Performance and Potential: An NPB Experimental Study , 2003, LCPC.
[81] Steve Sistare,et al. Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.
[82] Jason Duell,et al. An evaluation of current high-performance networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.
[83] Katherine A. Yelick,et al. A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.
[84] Anoop Gupta,et al. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..
[85] Sathish S. Vadhiyar,et al. Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[86] S. Sistare,et al. Optimization of MPI Collectives on Clusters of Large-Scale SMPs , 1999, ACM/IEEE SC 1999 Conference (SC'99).
[87] William Pugh,et al. Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures , 2003, LCPC.
[88] R. K. Shyamasundar,et al. Introduction to algorithms , 1996 .
[89] Dennis Shasha,et al. Efficient and correct execution of parallel programs that share memory , 1988, TOPL.
[90] Todd C. Mowry,et al. Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.
[91] Toshio Nakatani,et al. Detection and global optimization of reduction operations for distributed parallel machines , 1996, ICS '96.
[92] Chris J. Scheiman,et al. LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..
[93] Hans P. Zima,et al. The cascade high productivity language , 2004 .
[94] Manish Gupta,et al. PARADIGM: a compiler for automatic data distribution on multicomputers , 1993, ICS '93.
[95] C. Tseng,et al. UPC Implementation of an Unbalanced Tree Search Benchmark , 2003 .
[96] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.
[97] Kourosh Gharachorloo,et al. Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.
[98] Allen,et al. Optimizing Compilers for Modern Architectures , 2004 .
[99] Katherine A. Yelick,et al. Type Systems for Distributed Data Sharing , 2003, SAS.
[100] Wei Chen,et al. Message Strip-Mining Heuristics for High Speed Networks , 2004, VECPAR.
[101] Michael Wolfe,et al. Eeectiveness of Message Strip-mining for Regular and Irregular Communication , 1994 .