Optimizing off-chip accesses in multicores

In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of on-chip accesses gets reduced; and finally, the memory latency of off-chip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.

[1]  Natalie D. Enright Jerger,et al.  Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[2]  Uday Bondhugula,et al.  Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[3]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[4]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[5]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[6]  Mainak Chaudhuri PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  David A. Padua,et al.  Compiler Techniques for the Distribution of Data and Computation , 2003, IEEE Trans. Parallel Distributed Syst..

[8]  Reetuparna Das,et al.  Application-to-core mapping policies to reduce memory system interference in multi-core systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[9]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[10]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[11]  Hannu Tenhunen,et al.  Optimal memory controller placement for chip multiprocessor , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[12]  Frank Mueller,et al.  Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..

[13]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[14]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[15]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[16]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[17]  M. Franz,et al.  Splitting Data Objects to Increase Cache Utilization ( Preliminary Version , 9 th October 1998 ) , 1998 .

[18]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[19]  Kei Hiraki,et al.  Unified memory optimizing architecture: memory subsystem control with a unified predictor , 2012, ICS '12.

[20]  Luca Benini,et al.  Networks on chips - technology and tools , 2006, The Morgan Kaufmann series in systems on silicon.

[21]  Mahmut T. Kandemir,et al.  Neighborhood-aware data locality optimization for NoC-based multicores , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[22]  Wei Ding,et al.  Optimizing Off-Chip Accesses in Manycores , 2015 .

[23]  Thomas R. Gross,et al.  Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[24]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[25]  Alberto Ros,et al.  Distance-aware round-robin mapping for large NUCA caches , 2009, 2009 International Conference on High Performance Computing (HiPC).

[26]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[27]  Hyunjin Lee,et al.  A flexible data to L2 cache mapping approach for future multicore processors , 2006, MSPC '06.

[28]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[29]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[30]  Ryan N. Rakvic,et al.  Replacement techniques for dynamic NUCA cache designs on CMPs , 2013, The Journal of Supercomputing.