论文信息 - Optimizing off-chip accesses in multicores

Optimizing off-chip accesses in multicores

In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of on-chip accesses gets reduced; and finally, the memory latency of off-chip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.

[1] Natalie D. Enright Jerger,et al. Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[2] Uday Bondhugula,et al. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[3] Anoop Gupta,et al. Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[4] Mor Harchol-Balter,et al. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[5] Michael F. P. O'Boyle,et al. Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[6] Mainak Chaudhuri. PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7] David A. Padua,et al. Compiler Techniques for the Distribution of Data and Computation , 2003, IEEE Trans. Parallel Distributed Syst..

[8] Reetuparna Das,et al. Application-to-core mapping policies to reduce memory system interference in multi-core systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[9] Alexander Schrijver,et al. Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[10] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[11] Hannu Tenhunen,et al. Optimal memory controller placement for chip multiprocessor , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[12] Frank Mueller,et al. Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..

[13] Rudolf Eigenmann,et al. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[14] A. Snavely,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[15] Chau-Wen Tseng,et al. Data transformations for eliminating conflict misses , 1998, PLDI.

[16] John Zahorjan,et al. Optimizing Data Locality by Array Restructuring , 1995 .

[17] M. Franz,et al. Splitting Data Objects to Increase Cache Utilization ( Preliminary Version , 9 th October 1998 ) , 1998 .

[18] Sangyeun Cho,et al. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[19] Kei Hiraki,et al. Unified memory optimizing architecture: memory subsystem control with a unified predictor , 2012, ICS '12.

[20] Luca Benini,et al. Networks on chips - technology and tools , 2006, The Morgan Kaufmann series in systems on silicon.

[21] Mahmut T. Kandemir,et al. Neighborhood-aware data locality optimization for NoC-based multicores , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[22] Wei Ding,et al. Optimizing Off-Chip Accesses in Manycores , 2015 .

[23] Thomas R. Gross,et al. Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[24] Todd C. Mowry,et al. Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[25] Alberto Ros,et al. Distance-aware round-robin mapping for large NUCA caches , 2009, 2009 International Conference on High Performance Computing (HiPC).

[26] Mor Harchol-Balter,et al. ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[27] Hyunjin Lee,et al. A flexible data to L2 cache mapping approach for future multicore processors , 2006, MSPC '06.

[28] David A. Wood,et al. Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[29] Monica S. Lam,et al. Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[30] Ryan N. Rakvic,et al. Replacement techniques for dynamic NUCA cache designs on CMPs , 2013, The Journal of Supercomputing.