Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding

This paper presents an iteration space partitioning scheme to reduce the CPU idle time due to the long memory access latency. We take into consideration both the data accesses of intermediate and initial data. An algorithm is proposed to find the largest overlap for initial data to reduce the entire memory traffic. In order to efficiently hide the memory latency, another algorithm is developed to balance the ALU and memory schedules. The experiments on DSP benchmarks show that the algorithms significantly outperform the known existing methods.

[1]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[2]  Naraig Manjikian,et al.  Combining loop fusion with prefetching on shared-memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[3]  Edward S. Davidson,et al.  Register requirements of pipelined processors , 1992, ICS '92.

[4]  Edwin Hsing-Mean Sha,et al.  Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching , 2001, J. VLSI Signal Process..

[5]  Jacqueline Chame,et al.  A tile selection algorithm for data locality and cache interference , 1999, ICS '99.

[6]  Edwin Hsing-Mean Sha,et al.  Scheduling of uniform multidimensional systems under resource constraints , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[7]  Ken Kennedy,et al.  Automatic Data Layout Using 0-1 Integer Programming , 1994, IFIP PACT.

[8]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[9]  Edwin Hsing-Mean Sha,et al.  Schedule-based multi-dimensional retiming on data flow graphs , 1994, Proceedings of 8th International Parallel Processing Symposium.

[10]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[11]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[12]  Seung Ryoul Maeng,et al.  An adaptive sequential prefetching scheme in shared-memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[13]  Edwin Hsing-Mean Sha,et al.  Loop scheduling and partitions for hiding memory latencies , 1999, Proceedings 12th International Symposium on System Synthesis.

[14]  Edwin Hsing-Mean Sha,et al.  Optimizing DSP flow graphs via schedule-based multidimensional retiming , 1996, IEEE Trans. Signal Process..

[15]  Todd C. Mowry,et al.  Tolerating latency in multiprocessors through compiler-inserted prefetching , 1998, TOCS.

[16]  Tien-Fu Chen,et al.  Data prefetching for high-performance processors , 1993 .

[17]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[18]  V. van Dongen,et al.  Uniformization of linear recurrence equations: a step toward the automatic synthesis of systolic arrays , 1988, [1988] Proceedings. International Conference on Systolic Arrays.

[19]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).