Minimization of memory access overhead for multidimensional DSP applications via multilevel partitioning and scheduling

Massive uniform nested loops are broadly used in multidimensional digital signal processing (DSP) applications. Due to the large amount of data handled by such applications, the optimization of data accesses by fully utilizing the local memory and minimizing communication overhead is important in order to improve the overall system performance. Most of the traditional partition strategies do not consider the effect of data access on the computational performance. In this paper, a multilevel partitioning method, based on a static data scheduling technique known as carrot-hole data scheduling, is proposed to control the data traffic between different levels of memory. Based on this data schedule, optimal partition vector, scheduling vector and the partition size are chosen in such a way to minimize communication overhead. Nonhomogeneous size partitions are the final result of the partition scheme which produces a significant performance improvement. Experiments show that by using this technique, local memory misses are significantly reduced as compared to results obtained from traditional methods. This method can be used in application specific DSP system design and compiler for DSP processors.

[1]  Edwin Hsing-Mean Sha,et al.  Full Parallelism in Uniform Nested Loops Using Multi-Dimensional Retiming , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[2]  T. Risset,et al.  Precise tiling for uniform loop nests , 1995, Proceedings The International Conference on Application Specific Array Processors.

[3]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[4]  S. Kung,et al.  VLSI Array processors , 1985, IEEE ASSP Magazine.

[5]  William Jalby,et al.  A strategy for array management in local memory , 1994, Math. Program..

[6]  Patrice Quinton,et al.  Systolic algorithms and architectures , 1987 .

[7]  Minjoong Rim,et al.  Valid Transformations: A New Class of Loop Transformations , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[8]  Jean-Marc Delosme,et al.  Optimization of Computation Time for Systolic Arrays , 1992, IEEE Trans. Computers.

[9]  Yves Robert,et al.  Constructive Methods for Scheduling Uniform Loop Nests , 1994, IEEE Trans. Parallel Distributed Syst..

[10]  Minh N. Do,et al.  Youn-Long Steve Lin , 1992 .

[11]  J. Ramanujam,et al.  Non-unimodular transformations of nested loops , 1992, Proceedings Supercomputing '92.

[12]  J. Ramanujam,et al.  Compile-Time Techniques for Data Distribution in Distributed Memory Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[13]  Weijia Shang,et al.  Data alignment of loop nests without nonlocal communications , 1994, Proceedings of IEEE International Conference on Application Specific Array Processors (ASSAP'94).

[14]  Daniel A. Reed,et al.  Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems , 1987, IEEE Transactions on Computers.

[15]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[16]  Hugo De Man,et al.  Compiling multi-dimensional data streams into distributed DSP ASIC memory , 1991, 1991 IEEE International Conference on Computer-Aided Design Digest of Technical Papers.

[17]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[18]  Alexandru Nicolau,et al.  Advances in languages and compilers for parallel processing , 1991 .

[19]  Weijia Shang,et al.  Independent Partitioning of Algorithms with Uniform Dependencies , 1992, IEEE Trans. Computers.

[20]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[21]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[22]  Dan I. Moldovan,et al.  Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[23]  Joos Vandewalle,et al.  Background Memory Synthesis for Algebraic Algorithms on Multi-Processor DSP Chips , 1989 .

[24]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[25]  Jan M. Rabaey,et al.  Memory Estimation for High Level Synthesis , 1994, 31st Design Automation Conference.

[26]  Jang-Ping Sheu,et al.  Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[27]  Manfred Ruschitzka,et al.  Managing Locality Sets: The Model and Fixed-Size Buffers , 1993, IEEE Trans. Computers.

[28]  Edwin Hsing-Mean Sha,et al.  Carrot-hole Data Scheduling and AdaptivePartitioning for Memory Tra cMinimization , 1995 .

[29]  Edwin Hsing-Mean Sha,et al.  Static scheduling of uniform nested loops , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[30]  ALFRED FETTWEIS,et al.  Numerical integration of partial differential equations using principles of multidimensional wave digital filters , 1991, J. VLSI Signal Process..

[31]  A.K. Krishnamurthy,et al.  Multidimensional digital signal processing , 1985, Proceedings of the IEEE.

[32]  Joos Vandewalle,et al.  In-place memory management of algebraic algorithms on application specific ICs , 1991, J. VLSI Signal Process..

[33]  Daniel D. Gajski,et al.  High ― Level Synthesis: Introduction to Chip and System Design , 1992 .

[34]  Edwin Hsing-Mean Sha,et al.  Partitioning and retiming of multi-dimensional systems , 1994, Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS '94.

[35]  Santosh G. Abraham,et al.  Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic , 1991, IEEE Trans. Parallel Distributed Syst..

[36]  Wayne Burleson The partitioning problem on VLSI arrays: I/O and local memory complexity , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.