Data space-oriented tiling for enhancing locality

Improving locality of data references is becoming increasingly important due to increasing gap between processor cycle times and off-chip memory access latencies. Improving data locality not only improves effective memory access time but also reduces memory system energy consumption due to data references. An optimizing compiler can play an important role in enhancing data locality in array-intensive embedded media applications with regular data access patterns.This paper presents a compiler-based data space-oriented tiling approach (DST). In this strategy, the data space (e.g., an array of signals) is logically divided into chunks (called data tiles) and each data tile is processed in turn. In processing a data tile, our approach traverses the entire iteration space of all nests in the code and executes all iterations (potentially coming from different nests) that access the data tile being processed. In doing so, it also takes data dependences into account. Since a data space is common across all nests that access it, DST can potentially achieve better results than traditional iteration space (loop) tiling by exploiting internest data locality.We also present an example application of DST for improving the effectiveness of a scratch pad memory (SPM) for data accesses. SPMs are alternatives to conventional cache memories in embedded computing world. These small on-chip memories, like caches, provide fast and low-power access to data; but, they differ from conventional data caches in that their contents are managed by compiler instead of hardware. We have implemented DST in a source-to-source translator and quantified its benefits using a simulator. Our preliminary results with several array-intensive applications and varying input sizes show that our approach outperforms classical iteration space-oriented tiling as well as a data-oriented approach that considers each nest in isolation.

[1]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[2]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[3]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[4]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[5]  Monica S. Lam,et al.  An Overview of a Compiler for Scalable Parallel Machines , 1993, LCPC.

[6]  Luca Benini,et al.  Increasing Energy Efficiency of Embedded Systems by Application-Specific Memory Hierarchy Generation , 2000, IEEE Des. Test Comput..

[7]  Jingling Xue,et al.  Reuse-Driven Tiling for Data Locality , 1997, LCPC.

[8]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[9]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[10]  Francky Catthoor,et al.  Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design , 1998 .

[11]  Giovanni De Micheli,et al.  Software controlled power management , 1999, CODES '99.

[12]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[13]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[14]  Luca Benini,et al.  System-level power optimization: techniques and tools , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[15]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[16]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[17]  Saman Amarasinghe,et al.  The suif compiler for scalable parallel machines , 1995 .

[18]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[19]  Chaitali Chakrabarti,et al.  Memory exploration for low power, embedded systems , 1999, DAC '99.

[20]  Francky Catthoor,et al.  Custom Memory Management Methodology , 1998, Springer US.

[21]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[22]  Jennifer Eyre,et al.  DSP Processors Hit the Mainstream , 1998, Computer.

[23]  Santosh Pande,et al.  Loop Restructuring for Data I/O Minimization on Limited On-Chip Memory Embedded Processors , 2002, IEEE Trans. Computers.

[24]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[25]  Santosh Pande,et al.  Optimizing On-Chip Memory Usage Through Loop Restructuring for Embedded Processors , 2000 .

[26]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .