论文信息 - Neighborhood-aware data locality optimization for NoC-based multicores

Neighborhood-aware data locality optimization for NoC-based multicores

Data locality optimization is a critical issue for NoC (network-on-chip) based multicore systems. In this paper, focusing on a two-dimensional NoC-based multicore and dataintensive multithreaded applications, we first discuss a data locality aware scheduling algorithm for any given computation-to-core mapping, and then propose an integrated mapping+scheduling algorithm that performs both tasks together. Both our algorithms consider temporal (time-wise) and spatial (neighborhood-aware) data reuse, and try to minimize distance-to-data in on-chip cache accesses. We test the effectiveness of our compiler algorithms using a set of twelve application programs. Our experiments indicate that the proposed algorithms achieve significant improvements in data access latencies (42.7% on average) and overall execution times (24.1% on average). We also conduct a sensitivity analysis where we change the number of cores, on-chip cache capacities, and data movement (migration) strategies. These experiments show that our proposed algorithms generate consistently good results.

[1] W. Dally,et al. Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[2] William Pugh,et al. The Omega Library interface guide , 1995 .

[3] Uday Bondhugula,et al. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[4] Tulika Mitra,et al. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures , 2006, CASES '06.

[5] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.

[6] Max B Aron. The single-chip cloud computer , 2010 .

[7] Zeshan Chishti,et al. Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures , 2003, MICRO.

[8] Zeshan Chishti,et al. Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9] David A. Wood,et al. Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[10] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[11] Francesco Poletti,et al. Communication-aware allocation and scheduling framework for stream-oriented multi-processor systems-on-chip , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[12] Rainer Leupers,et al. A modular simulation framework for spatial and temporal task mapping onto multi-processor SoC platforms , 2005, Design, Automation and Test in Europe.

[13] Evangelos P. Markatos,et al. Using processor affinity in loop scheduling on shared-memory multiprocessors , 1992, Supercomputing '92.

[14] Jim Held. "Single-chip Cloud Computer", an IA Tera-scale Research Processor , 2010, Euro-Par Workshops.

[15] Guy E. Blelloch,et al. Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[16] Norman P. Jouppi,et al. Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[17] Radu Marculescu,et al. User-Aware Dynamic Task Allocation in Networks-on-Chip , 2008, 2008 Design, Automation and Test in Europe.

[18] Frédéric Pétrot,et al. Comparison of memory write policies for NoC based Multicore Cache Coherent Systems , 2008, 2008 Design, Automation and Test in Europe.

[19] Michael E. Wolf,et al. Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[20] Michael Wolfe,et al. High performance compilers for parallel computing , 1995 .

[21] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[22] Mahmut T. Kandemir,et al. Application mapping for chip multiprocessors , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[23] Dean M. Tullsen,et al. Compiler Techniques for Reducing Data Cache Miss Rate on a Multithreaded Architecture , 2008, HiPEAC.

[24] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25] Chen Ding,et al. A hierarchical model of data locality , 2006, POPL '06.

[26] William J. Dally,et al. Flattened Butterfly Topology for On-Chip Networks , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[27] T. N. Vijaykumar,et al. Distance associativity for high-performance energy-efficient non-uniform cache architectures , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[28] Xipeng Shen,et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.

[29] Keith W. Ross,et al. Computer networking - a top-down approach featuring the internet , 2000 .

[30] Mahmut T. Kandemir,et al. Cache topology aware computation mapping for multicores , 2010, PLDI '10.

[31] Mahmut T. Kandemir,et al. Optimizing shared cache behavior of chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32] Vincenzo Catania,et al. Multi-objective mapping for mesh-based NoC architectures , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[33] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[34] Hyunjin Lee,et al. A flexible data to L2 cache mapping approach for future multicore processors , 2006, MSPC '06.

[35] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[36] Rudolf Eigenmann,et al. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[37] Scott A. Mahlke,et al. Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[38] R. Pop,et al. Mapping applications to NoC platforms with multithreaded processor resources , 2005, 2005 NORCHIP.

[39] Milo M. K. Martin,et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[40] Radu Marculescu,et al. Contention-aware application mapping for Network-on-Chip communication architectures , 2008, 2008 IEEE International Conference on Computer Design.