Computational power of pipelined memory hierarchies

We define a model of computation, called the <i>Pipelined Hierarchical Random Access Machine</i> with <i>access function a</i> (<i>x</i>), denoted the <i>a</i>(<i>x</i>)-PH-RAM. In this model, a processor interacts with a memory which can accept requests at a constant rate and satisfy each of the requests to the location <i>x</i> within a(<i>x</i>) units of time. We investigate memory management strategies that lead to time efficient implementations of arbitrary computations on a PH-RAM. We begin by developing the so called <i>pipeline d decomposition-tree</i>memory management strategy, which can be tuned to the memory access function. Specifically, for a linear or sublinear access function <i>a</i>(<i>x</i>), w e define the concept of <i>latency-hiding depth d<subscrpt>a</subscrpt></i>(<i>x</i>) and show ho w an y computation of <i>N</i> operations can be implemented on an <i>a</i>(<i>x</i>)-PH-RAM in time <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>Nd<subscrpt>a</subscrpt></i>(<i>N</i>)). In particular, <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>N</i> log <i>N</i>) if <i>a</i>(<i>x</i>) = <i>&Ogr;</i>(<i>x</i>), <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>N</i> log log <i>N</i>) if <i>a</i>(<i>x</i>) = <i>&Ogr;</i>(<i>x<sup>Β</sup></i>) with 0 < <i>Β</i> < 1, and <i>T</i>(<i>N</i>) = O(<i>N</i> log* <i>N</i>) if <i>a</i>(<i>x</i>) = <i>&Ogr;</i>(log <i>x</i>). We develop lower bound techniques that allow to establish existential lower bounds on PH-RAMs. In particular, we exhibit computations for which <i>T</i>(<i>N</i>) = &OHgr;(<i>N</i>log <i>N</i>/ log log <i>N</i>) when <i>a</i>(<i>x</i>) = &OHgr;(<i>x</i>), <i>T</i>(<i>N</i>) = &OHgr;(<i>N</i>log log<i>N</i>) when <i>a</i>(<i>x</i>) = &OHgr;(<i>x<sup>Β</sup></i>) with 0 < <i>Β</i> < 1, and <i>T</i>(<i>N</i>) = &OHgr;(<i>N</i> log* <i>N</i>) when <i>a</i>(<i>x</i>) = &OHgr;(log <i>x</i>). The stated lower bounds show that the pipelined decomposition-tree strategy is existentially optimal for the latter case but indicates the potential for a modest, <i>&Ogr;</i>(log log <i>N</i>) improvement for linear access functions. To realize this potential, a <i>superpipelined</i> decomposition-tree memory manager is proposed, which achieves <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>N</i> log <i>N</i>/log log <i>N</i>). The pipelined decomposition-tree strategy can also be tuned to the computation, in order to exploit its <i>temporal locality</i> as characterized by the width parameters [9]. When the latter are suitably bounded, then <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>N</i>) on any PH-RAM with linear or sublinear access function. Finally, we discuss how performance could benefit from <i>parallelism</i> in the data-dependence dag of the computation or from architectural enhancements, such as <i>block-transfer</i> primitives, and formulate various questions that deserve further investigation.

[1]  Steven A. Przybylski,et al.  Cache and memory hierarchy design: a performance-directed approach , 1990 .

[2]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[3]  Stephen A. Cook,et al.  Time-bounded random access machines , 1972, J. Comput. Syst. Sci..

[4]  John E. Savage,et al.  Models of computation - exploring the power of computing , 1998 .

[5]  V. Milutinovic,et al.  Enhancing and Exploiting the Locality , 1999, IEEE Trans. Computers.

[6]  Nancy M. Amato,et al.  Predicting performance on SMPs. A case study: the SGI Power Challenge , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[7]  Gianfranco Bilardi,et al.  A Characterization of Temporal Locality and Its Portability across Memory Hierarchies , 2001, ICALP.

[8]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[10]  F. P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.

[11]  Gianfranco Bilardi,et al.  An approach towards an analytical characterization of locality and its portability , 2001, 2001 Innovative Architecture for Future Generation High-Performance Processors and Systems.

[12]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[13]  Franco P. Preparata,et al.  Horizons of Parallel Computation , 1992, J. Parallel Distributed Comput..

[14]  Andrea Pietracaprina,et al.  On the Space and Access Complexity of Computation DAGs , 2000, WG.

[15]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[16]  Jeffrey Scott Vitter,et al.  External memory algorithms , 1998, ESA.

[17]  John A. Fotheringham,et al.  Dynamic storage allocation in the Atlas computer, including an automatic use of a backing store , 1961, Commun. ACM.