Static Data Allocation and Load Balancing Techniques for Heterogeneous Systems

[1]  Michael Werman,et al.  The decomposition of a square into rectangles of minimal perimeter , 1987, Discret. Appl. Math..

[2]  Michael J. Quinn,et al.  Block data decomposition for data-parallel programming on a heterogeneous workstation network , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[3]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[4]  D. West Introduction to Graph Theory , 1995 .

[5]  Alexey L. Lastovetsky,et al.  Heterogeneous Distribution of Computations While Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 1999, HPCN Europe.

[6]  Jaswinder Pal Singh,et al.  The effects of communication parameters on end performance of shared virtual memory clusters , 1997, SC '97.

[7]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[8]  Andrew S. Grimshaw,et al.  The Legion vision of a worldwide virtual computer , 1997, Commun. ACM.

[9]  Galen C. Hunt,et al.  Vm-based Shared Memory On Low-latency, Remote-memory-access Networks , 1996, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Ricardo Bianchini,et al.  Lazy release consistency for hardware-coherent multiprocessors , 1995 .

[11]  Hochang Lee,et al.  A Hybrid Bounding Procedure for the Workload Allocation Problem on Parallel Unrelated Machines with Setups , 1996 .

[12]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[13]  Liviu Iftode,et al.  Improving release-consistent shared virtual memory using automatic update , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[14]  James R. Larus,et al.  Implementing Fine-grain Distributed Shared Memory on Commodity SMP Workstations , 1996 .

[15]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[16]  Chelliah Sriskandarajah,et al.  Parallel machine scheduling with a common server , 2000, Discret. Appl. Math..

[17]  Liviu Iftode,et al.  Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems , 1996, OSDI '96.

[18]  Angelos Bilas,et al.  Shared Virtual Memory Clusters with Next-Generation Interconnection Networks and Wide Compute Nodes , 2001, HiPC.

[19]  Larry Carter,et al.  Determining the idle time of a tiling , 1997, POPL '97.

[20]  Prosenjit Bose,et al.  Cutting rectangles in equal area pieces , 1998, CCCG.

[21]  Jason Maassen,et al.  Parallel Computing on Wide-Area Clusters: the Albatross Project, , 1999 .

[22]  Yuanyuan Zhou,et al.  Limits to the performance of software shared memory: a layered approach , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[23]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[24]  Hiroshi Ohta,et al.  Optimal tile size adjustment in compiling general DOACROSS loop nests , 1995, ICS '95.

[25]  Kai Li,et al.  Understanding Application Performance on Shared Virtual Memory Systems , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[26]  Donald Yeung,et al.  Multigrain shared memory , 2000, TOCS.

[27]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[28]  Arnold L. Rosenberg,et al.  Sharing partitionable workloads in heterogeneous NOWs: greedier is not better , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[29]  Yves Robert,et al.  A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers) , 2001, IEEE Trans. Computers.

[30]  Jaswinder Pal Singh,et al.  Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors , 1997, PPOPP '97.

[31]  Kunle Olukotun,et al.  The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[32]  Henri Casanova,et al.  NetSovle: A Network Server for Solving Computational Science Problems , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[33]  Kai Li,et al.  Design and implementation of virtual memory-mapped communication on Myrinet , 1997, Proceedings 11th International Parallel Processing Symposium.

[34]  Cezary Dubnicki,et al.  VMMC-2 : Efficient Support for Reliable, Connection-Oriented Communication , 1997 .

[35]  Michael L. Scott,et al.  Using memory-mapped network interfaces to improve the performance of distributed shared memory , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[36]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[37]  Geoffrey C. Fox,et al.  Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..

[38]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[39]  Brian N. Bershad,et al.  Fast Interrupt Priority Management in Operating System Kernels , 1993, USENIX Microkernels and Other Kernel Architectures Symposium.

[40]  A. W. Roscoe,et al.  The Decomposition of a Rectangle into Rectangles of Minimal Perimeter , 1988, SIAM J. Comput..

[41]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[42]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[43]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[44]  Stephen Taylor,et al.  A Practical Approach to Dynamic Load Balancing , 1998, IEEE Trans. Parallel Distributed Syst..

[45]  Peter Brucker,et al.  Complexity results for parallel machine problems with a single server , 2002 .

[46]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[47]  Svetlana A. Kravchenko,et al.  Parallel machine scheduling problems with a single server , 1997 .

[48]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[49]  Kevin Skadron,et al.  Design issues and tradeoffs for write buffers , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[50]  Michel Minoux,et al.  Graphs and Algorithms , 1984 .

[51]  H. T. Kung,et al.  Path Planning On The Warp Computer: Using A Linear Systolic Array In Dynamic Programming , 1988, Optics & Photonics.

[52]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[53]  Ramesh C. Agarwal,et al.  A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..

[54]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[55]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[56]  John L. Hennessy,et al.  SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.

[57]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[58]  Ricardo Bianchini,et al.  Hiding communication latency and coherence overhead in software DSMs , 1996, ASPLOS VII.

[59]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[60]  Yves Robert,et al.  Matrix Multiplication on Heterogeneous Platforms , 2001, IEEE Trans. Parallel Distributed Syst..

[61]  Liviu Iftode,et al.  Relaxed consistency and coherence granularity in DSM systems: a performance evaluation , 1997, PPOPP '97.

[62]  Per Stenström,et al.  Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[63]  Guy Lemieux,et al.  The NUMAchine multiprocessor , 2000, Proceedings 2000 International Conference on Parallel Processing.