论文信息 - Data I/O Minimization for Loops on Limited Onchip Memory Processors

Data I/O Minimization for Loops on Limited Onchip Memory Processors

Due to significant advances in VLSI technology, ‘mega-processors’ made with a large number of transistors has become a reality. These processors typically provide multiple functional units which allow exploitation of parallelism. In order to cater to the data demands associated with parallelism, the processors provide a limited amount of on-chip memory. The amount of memory provided is quite limited due to higher area and power requirements associated with it. Even though limited, such on-chip memory is a very valuable resource in memory hierarchy. An important use of on-chip memory is to hold the instructions from short loops along with the associated data for very fast computation. Such schemes are very attractive on embedded processors where, due to the presence of dedicated hard-ware on-chip (such as very fast multipliers-shifters etc.) and extremely fast accesses to on-chip data, the computation time of such loops is extremely small meeting almost all real-time demands. Biggest bottleneck to performance in these cases are off-chip accesses and thus, compilers must carefully analyze references to identify good candidates for promotion to on-chip memory. In our earlier work [6], we formulated this problem in terms of 0/1 knapsack and proposed a heuristic solution that gives us good promotion candidates. Our analysis was limited to a single loop nest. When we attempted extending this framework to multiple loop nests (intra-procedurally), we realized that not only it is important to identify good candidates for promotion but a careful restructuring of loops must be undertaken before performing promotion since data i/o of loading and storing values to on-chip memory poses a significant bottleneck.

Santosh Pande | Lei Wang | L. Wang | S. Pande

[1] Jack Dongarra,et al. Automatic Blocking of Nested Loops , 1990 .

[2] Michael Wolfe,et al. Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[3] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[4] V. Sarkar,et al. Collective Loop Fusion for Array Contraction , 1992, LCPC.

[5] Santosh Pande,et al. An Efficient Data Partitioning Method for Limited Memory Embedded Systems , 1998, LCTES.

[6] Keshav Pingali,et al. Data-centric multi-level blocking , 1997, PLDI '97.