Network-Oblivious Algorithms

The design of algorithms that can run unchanged yet efficiently on a variety of machines characterized by different degrees of parallelism and communication capabilities is a highly desirable goal. We propose a framework for network-obliviousness based on a model of computation where the only parameter is the problem's input size. Algorithms are then evaluated on a model with two parameters, capturing parallelism and granularity of communication. We show that, for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in a block-variant of the decomposable BSP model, which effectively describes a wide and significant class of parallel platforms. We illustrate our framework by providing optimal network-oblivious algorithms for a few key problems, and also establish some negative results.

[1]  A G WijshoffHarry,et al.  A quantitative comparison of parallel computation models , 1998 .

[2]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[3]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[4]  Geppino Pucci,et al.  Area-time tradeoffs for universal VLSI circuits , 2008, Theor. Comput. Sci..

[5]  Geppino Pucci,et al.  Area-universal circuits with constant slowdown , 1999, Proceedings 20th Anniversary Conference on Advanced Research in VLSI.

[6]  Geppino Pucci,et al.  Network-Oblivious Algorithms , 2007, IPDPS.

[7]  F. Thomson Leighton,et al.  ARRAYS AND TREES , 1992 .

[8]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[9]  Geppino Pucci,et al.  A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing , 1999, Euro-Par.

[10]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[11]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[12]  Bruce M. Maggs,et al.  Communication-efficient parallel algorithms for distributed random-access machines , 1988, Algorithmica.

[13]  Yossi Matias,et al.  Can shared-memory model serve as a bridging model for parallel computation? , 1997, SPAA '97.

[14]  Ben H. H. Juurlink,et al.  A quantitative comparison of parallel computation models , 1996, SPAA '96.

[15]  Gerth Stølting Brodal,et al.  On the limits of cache-obliviousness , 2003, STOC '03.

[16]  Friedhelm Meyer auf der Heide,et al.  Truly Efficient Parallel Algorithms: 1-optimal Multisearch for an Extension of the BSP Model , 1998, Theor. Comput. Sci..

[17]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[19]  John E. Savage,et al.  Models of computation - exploring the power of computing , 1998 .

[20]  F. P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.

[21]  Vijaya Ramachandran,et al.  Cache-efficient dynamic programming algorithms for multicores , 2008, SPAA '08.

[22]  Francesco Silvestri,et al.  On the Limits of Cache-Oblivious Matrix Transposition , 2006, TGC.

[23]  Geppino Pucci,et al.  Decomposable BSP: A Bandwidth-Latency Model for Parallel and Hierarchical Computation , 2007 .

[24]  Geppino Pucci,et al.  Cache-oblivious simulation of parallel programs , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[25]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[26]  Yossi Matias,et al.  Can shared-memory model serve as a bridging model for parallel computation? , 1997, SPAA '97.

[27]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[28]  Frank Thomson Leighton,et al.  Tight Bounds on the Complexity of Parallel Sorting , 1985, IEEE Trans. Computers.

[29]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[30]  Francesco Silvestri,et al.  On the limits of cache-oblivious rational permutations , 2008, Theor. Comput. Sci..

[31]  Gianfranco Bilardi,et al.  A Characterization of Temporal Locality and Its Portability across Memory Hierarchies , 2001, ICALP.

[32]  Clyde P. Kruskal,et al.  Submachine Locality in the Bulk Synchronous Setting (Extended Abstract) , 1996, Euro-Par, Vol. II.

[33]  ToledoSivan,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004 .

[34]  Frank Thomson Leighton Introduction to parallel algorithms and architectures: arrays , 1992 .

[35]  Franco P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.

[36]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[37]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[38]  Franco P. Preparata,et al.  Horizons of Parallel Computation , 1992, J. Parallel Distributed Comput..

[39]  Mithuna Thottethodi,et al.  Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[40]  Leslie G. Valiant A Bridging Model for Multi-core Computing , 2008, ESA.

[41]  Alexander Tiskin,et al.  The Bulk-Synchronous Parallel Random Access Machine , 1996, Theor. Comput. Sci..

[42]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[43]  Michael T. Goodrich,et al.  Communication-Efficient Parallel Sorting , 1999, SIAM J. Comput..

[44]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[45]  Frank Thomson Leighton,et al.  Tight Bounds on the Complexity of Parallel Sorting , 1984, IEEE Transactions on Computers.

[46]  Volker Strumpen,et al.  The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.

[47]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[48]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[49]  L. R. Kerr The Effect of Algebraic Structure on the Computational Complexity of Matrix Multiplication , 1970 .

[50]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.