Loop scheduling with complete memory latency hiding on multi-core architecture

The widening gap between processor and memory performance is the main bottleneck for modern computer systems to achieve high processor utilization. In this paper, we propose a new loop scheduling with memory management technique, iterational retiming with partitioning (IRP), that can completely hide memory latencies for applications with multi-dimensional loops on architectures like CELL processor (J.A. Kahle et al., 2005). In IRP, the iteration space is first partitioned carefully. Then a two-part schedule, consisting of processor and memory parts, is produced such that the execution time of the memory part never exceeds the execution time of the processor part. These two parts are executed simultaneously and complete memory latency hiding is reached. Experiments on DSP benchmarks show that IRP consistently produces optimal solutions as well as significant improvement over previous techniques

[1]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[2]  Manoj Franklin,et al.  Control flow prediction with tree-like subgraphs for superscalar processors , 1995, MICRO 1995.

[3]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[4]  Edwin Hsing-Mean Sha,et al.  Loop scheduling and partitions for hiding memory latencies , 1999, Proceedings 12th International Symposium on System Synthesis.

[5]  Edwin Hsing-Mean Sha,et al.  Iterational retiming: maximize iteration-level parallelism for nested loops , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[6]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[7]  Charles E. Leiserson,et al.  Retiming synchronous circuitry , 1988, Algorithmica.

[8]  Edwin Hsing-Mean Sha,et al.  Optimizing Overall Loop Schedules Using Prefetching and Partitioning , 2000, IEEE Trans. Parallel Distributed Syst..

[9]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[10]  Edwin Hsing-Mean Sha,et al.  Scheduling and partitioning for multiple loop nests , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[11]  Edwin Hsing-Mean Sha,et al.  Scheduling of uniform multidimensional systems under resource constraints , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[12]  Edwin Hsing-Mean Sha,et al.  Rotation Scheduling: A Loop Pipelining Algorithm , 1993, 30th ACM/IEEE Design Automation Conference.

[13]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[14]  Seung Ryoul Maeng,et al.  An adaptive sequential prefetching scheme in shared-memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[15]  Edwin Hsing-Mean Sha,et al.  Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding , 2002, EURASIP J. Adv. Signal Process..

[16]  Edwin Hsing-Mean Sha,et al.  Rotation scheduling: a loop pipelining algorithm , 1997, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[17]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.