We define a model of computation, called the <i>Pipelined Hierarchical Random Access Machine</i> with <i>access function a</i> (<i>x</i>), denoted the <i>a</i>(<i>x</i>)-PH-RAM. In this model, a processor interacts with a memory which can accept requests at a constant rate and satisfy each of the requests to the location <i>x</i> within a(<i>x</i>) units of time.
We investigate memory management strategies that lead to time efficient implementations of arbitrary computations on a PH-RAM. We begin by developing the so called <i>pipeline d decomposition-tree</i>memory management strategy, which can be tuned to the memory access function. Specifically, for a linear or sublinear access function <i>a</i>(<i>x</i>), w e define the concept of <i>latency-hiding depth d<subscrpt>a</subscrpt></i>(<i>x</i>) and show ho w an y computation of <i>N</i> operations can be implemented on an <i>a</i>(<i>x</i>)-PH-RAM in time <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>Nd<subscrpt>a</subscrpt></i>(<i>N</i>)). In particular, <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>N</i> log <i>N</i>) if <i>a</i>(<i>x</i>) = <i>&Ogr;</i>(<i>x</i>), <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>N</i> log log <i>N</i>) if <i>a</i>(<i>x</i>) = <i>&Ogr;</i>(<i>x<sup>Β</sup></i>) with 0 < <i>Β</i> < 1, and <i>T</i>(<i>N</i>) = O(<i>N</i> log* <i>N</i>) if <i>a</i>(<i>x</i>) = <i>&Ogr;</i>(log <i>x</i>).
We develop lower bound techniques that allow to establish existential lower bounds on PH-RAMs. In particular, we exhibit computations for which <i>T</i>(<i>N</i>) = &OHgr;(<i>N</i>log <i>N</i>/ log log <i>N</i>) when <i>a</i>(<i>x</i>) = &OHgr;(<i>x</i>), <i>T</i>(<i>N</i>) = &OHgr;(<i>N</i>log log<i>N</i>) when <i>a</i>(<i>x</i>) = &OHgr;(<i>x<sup>Β</sup></i>) with 0 < <i>Β</i> < 1, and <i>T</i>(<i>N</i>) = &OHgr;(<i>N</i> log* <i>N</i>) when <i>a</i>(<i>x</i>) = &OHgr;(log <i>x</i>).
The stated lower bounds show that the pipelined decomposition-tree strategy is existentially optimal for the latter case but indicates the potential for a modest, <i>&Ogr;</i>(log log <i>N</i>) improvement for linear access functions. To realize this potential, a <i>superpipelined</i> decomposition-tree memory manager is proposed, which achieves <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>N</i> log <i>N</i>/log log <i>N</i>).
The pipelined decomposition-tree strategy can also be tuned to the computation, in order to exploit its <i>temporal locality</i> as characterized by the width parameters [9]. When the latter are suitably bounded, then <i>T</i>(<i>N</i>) = <i>&Ogr;</i>(<i>N</i>) on any PH-RAM with linear or sublinear access function. Finally, we discuss how performance could benefit from <i>parallelism</i> in the data-dependence dag of the computation or from architectural enhancements, such as <i>block-transfer</i> primitives, and formulate various questions that deserve further investigation.
[1]
Steven A. Przybylski,et al.
Cache and memory hierarchy design: a performance-directed approach
,
1990
.
[2]
Charles E. Leiserson,et al.
Cache-Oblivious Algorithms
,
2003,
CIAC.
[3]
Stephen A. Cook,et al.
Time-bounded random access machines
,
1972,
J. Comput. Syst. Sci..
[4]
John E. Savage,et al.
Models of computation - exploring the power of computing
,
1998
.
[5]
V. Milutinovic,et al.
Enhancing and Exploiting the Locality
,
1999,
IEEE Trans. Computers.
[6]
Nancy M. Amato,et al.
Predicting performance on SMPs. A case study: the SGI Power Challenge
,
2000,
Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.
[7]
Gianfranco Bilardi,et al.
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies
,
2001,
ICALP.
[8]
Alok Aggarwal,et al.
Hierarchical memory with block transfer
,
1987,
28th Annual Symposium on Foundations of Computer Science (sfcs 1987).
[9]
Michael Wolfe,et al.
High performance compilers for parallel computing
,
1995
.
[10]
F. P. Preparata,et al.
Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds
,
1995,
Theory of Computing Systems.
[11]
Gianfranco Bilardi,et al.
An approach towards an analytical characterization of locality and its portability
,
2001,
2001 Innovative Architecture for Future Generation High-Performance Processors and Systems.
[12]
Irving L. Traiger,et al.
Evaluation Techniques for Storage Hierarchies
,
1970,
IBM Syst. J..
[13]
Franco P. Preparata,et al.
Horizons of Parallel Computation
,
1992,
J. Parallel Distributed Comput..
[14]
Andrea Pietracaprina,et al.
On the Space and Access Complexity of Computation DAGs
,
2000,
WG.
[15]
Bowen Alpern,et al.
A model for hierarchical memory
,
1987,
STOC.
[16]
Jeffrey Scott Vitter,et al.
External memory algorithms
,
1998,
ESA.
[17]
John A. Fotheringham,et al.
Dynamic storage allocation in the Atlas computer, including an automatic use of a backing store
,
1961,
Commun. ACM.