Locality management using multiple SPMs on the Multi-Level Computing Architecture

The multi-level computing architecture (MLCA) is a novel system-on-chip architecture for embedded systems designed to exploit task-level and instruction-level parallelism in multimedia applications. The MLCA provides a unique two-level programming model that simplifies the development of embedded applications. To cope with increasing intra-system communication delays, we introduce a distributed memory version of the MLCA where separate storage is used for global and local application data. Global data is stored on multiple on-chip scratch-pad memories (SPMs) with non-uniform-memory access (NUMA) latencies, while local data is stored on PU-private memories. In such designs, one of the key factors affecting application performance is the locality of access to global data. We introduce programming constructs and run-time support to dynamically manage data stored in the SPMs and to influence run-time task scheduling. Collectively, our techniques improve performance by 6%-40%, compared to simple static memory management and scheduling approaches

[1]  Martin Schulz,et al.  ARS: an adaptive runtime system for locality optimization , 2003, Future Gener. Comput. Syst..

[2]  Luca Benini,et al.  Polynomial-time algorithm for on-chip scratchpad memory partitioning , 2003, CASES '03.

[3]  Utku Aydonat,et al.  COMPILER SUPPORT FOR A MULTIMEDIA SYSTEM-ON-CHIP ARCHITECTURE , 2005 .

[4]  Luca Benini,et al.  An integrated hardware/software approach for run-time scratchpad management , 2004, Proceedings. 41st Design Automation Conference, 2004..

[5]  Tarek S. Abdelrahman,et al.  Power optimizations for the MLCA using dynamic voltage scaling , 2005, SCOPES '05.

[6]  Tarek S. Abdelrahman,et al.  A multilevel computing architecture for embedded multimedia applications , 2004, IEEE Micro.

[7]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[8]  David K. Lowenthal,et al.  An Integrated Compiler/Run-Time System for Global Data Distribution in Distributed Shared Memory Systems∗ , 2002 .

[9]  Evangelos P. Markatos,et al.  Using processor affinity in loop scheduling on shared-memory multiprocessors , 1992, Supercomputing '92.

[10]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[11]  Mahmut T. Kandemir,et al.  Exploiting shared scratch pad memory space in embedded multiprocessor systems , 2002, DAC '02.

[12]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[13]  Anoop Gupta,et al.  The impact of operating system scheduling policies and synchronization methods of performance of parallel applications , 1991, SIGMETRICS '91.

[14]  Rupert W. Ford,et al.  Feedback Guided Scheduling of Nested Loops , 2000, PARA.

[15]  Rajeev Barua,et al.  Heterogeneous memory management for embedded systems , 2001, CASES '01.