Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

The widening gap between processor and memory performance is the main bottleneck for modern computer systems to achieve high processor utilization. To hide memory latency, a variety of techniques have been proposed—from intermediate fast memories (caches) to various prefetching and memory management techniques. In this article, we propose a new loop scheduling with memory management technique, Iterational Retiming with Partitioning (IRP), that can completely hide memory latencies for applications with multidimensional loops on architectures like CELL processor. In IRP, the iteration space is first partitioned carefully. Then a two-part schedule, consisting of processor and memory parts, is produced such that the execution time of the memory part never exceeds the execution time of the processor part. These two parts are executed simultaneously and complete memory latency hiding is reached. In this article, we prove that such optimal two-part schedule can always be achieved given the right partition size and shape. Experiments on DSP benchmarks show that IRP consistently produces optimal solutions as well as significant improvement over previous techniques.

[1]  Edwin Hsing-Mean Sha,et al.  Loop scheduling and partitions for hiding memory latencies , 1999, Proceedings 12th International Symposium on System Synthesis.

[2]  Nader Bagherzadeh,et al.  Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors , 1998, IEEE Trans. Parallel Distributed Syst..

[3]  Edwin Hsing-Mean Sha,et al.  Iterational retiming: maximize iteration-level parallelism for nested loops , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[4]  Srivaths Ravi,et al.  High-level synthesis of distributed logic-memory architectures , 2002, ICCAD 2002.

[5]  Shlomit S. Pinter,et al.  Tango: a hardware-based data prefetching technique for superscalar processors , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[6]  Edwin Hsing-Mean Sha,et al.  Optimizing Overall Loop Schedules Using Prefetching and Partitioning , 2000, IEEE Trans. Parallel Distributed Syst..

[7]  Kai Li,et al.  Thread scheduling for cache locality , 1996, ASPLOS VII.

[8]  Edwin Hsing-Mean Sha,et al.  Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding , 2002, EURASIP J. Adv. Signal Process..

[9]  Naraig Manjikian,et al.  Combining loop fusion with prefetching on shared-memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[10]  Ricardo Bianchini,et al.  Data prefetching for software DSMs , 1998, ICS '98.

[11]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[12]  Seung Ryoul Maeng,et al.  An adaptive sequential prefetching scheme in shared-memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[13]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[14]  Charles E. Leiserson,et al.  Retiming synchronous circuitry , 1988, Algorithmica.

[15]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[16]  Yoji Yamada,et al.  Data relocation and prefetching for programs with large data sets , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[17]  Edwin Hsing-Mean Sha,et al.  Rotation scheduling: a loop pipelining algorithm , 1997, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[18]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[19]  Edwin Hsing-Mean Sha,et al.  Rotation Scheduling: A Loop Pipelining Algorithm , 1993, 30th ACM/IEEE Design Automation Conference.

[20]  Edwin Hsing-Mean Sha,et al.  Scheduling and partitioning for multiple loop nests , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[21]  Edwin Hsing-Mean Sha,et al.  Scheduling of uniform multidimensional systems under resource constraints , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[22]  Michel Dubois,et al.  Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[23]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[24]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[25]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.

[26]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..