3D-DRAM Performance for Different OpenMP Scheduling Techniques in Multicore Systems

Advances in memory technologies including 3DDRAM memories (such as High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) systems), wide I/O memory promise very large bandwidths at lower power consumption to address the needs of high-performance computing as well as emerging big data applications. However, in order to fully benefit from such bandwidths, it is necessary to understand how to optimally organize data across channels, ranks, banks or vaults of the memory structures, how to obtain large volumes of data with fewer accesses and how to schedule threads of multi threaded applications to benefit from these memory organizations. In this paper, we will examine different memory organizations that spread data across channels, ranks, and banks and identify application features that benefit from different organizations. Our study applies to generic DDR memory structures as well as 3DDRAMs. We will also evaluate scheduling of OpenMP threads (e.g., using static, dynamic and guided) but with emphasis on how different scheduling methods benefit from different memory organizations. Using the best scheduling for the application, proper memory organization, our experiments show, we can achieve up to 16 percent performance gains depending on workload.

[1]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[2]  Axel Jantsch,et al.  A survey of memory architecture for 3D chip multi-processors , 2014, Microprocess. Microsystems.

[3]  Mike Ignatowski,et al.  A new perspective on processing-in-memory architecture design , 2013, MSPC '13.

[4]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  David Novo,et al.  Position Paper: OpenMP scheduling on ARM big.LITTLE architecture , 2016 .

[6]  So-Ra Kim,et al.  8Gb 3D DDR3 DRAM using through-silicon-via technology , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[7]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[8]  Alejandro Duran,et al.  Is the Schedule Clause Really Necessary in OpenMP? , 2003, WOMPAT.

[9]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10]  Krishna M. Kavi,et al.  Exploring the Processing-in-Memory design space , 2017, J. Syst. Archit..

[11]  Luca Benini,et al.  Design space exploration for 3D-stacked DRAMs , 2011, 2011 Design, Automation & Test in Europe.

[12]  Krishna M. Kavi,et al.  Dataflow based Near Data Computing Achieves Excellent Energy Efficiency , 2017, HEART.

[13]  Nanning Zheng,et al.  3D DRAM Design and Application to 3D Multicore Systems , 2009, IEEE Design & Test of Computers.

[14]  Jaejin Lee,et al.  High bandwidth memory(HBM) with TSV technique , 2016, 2016 International SoC Design Conference (ISOCC).

[15]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[16]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[17]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  Joonyoung Kim,et al.  HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[19]  Krishna M. Kavi,et al.  Memory organizations for 3D-DRAMs and PCMs in processor memory hierarchy , 2015, J. Syst. Archit..

[20]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[21]  Young-Hyun Jun,et al.  8 Gb 3-D DDR3 DRAM Using Through-Silicon-Via Technology , 2009, IEEE Journal of Solid-State Circuits.

[22]  Aamer Jaleel,et al.  CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Krishna M. Kavi,et al.  DVFS Space Exploration in Power Constrained Processing-in-Memory Systems , 2017, ARCS.

[24]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[25]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Franz Franchetti,et al.  3D DRAM based application specific hardware accelerator for SpMV , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[27]  Krishna M. Kavi,et al.  HBM-Resident Prefetching for Heterogeneous Memory System , 2017, ARCS.

[28]  R. A. A. Raof,et al.  Performance Analysis of OpenMP Scheduling Type on Embarrassingly Parallel Matrix Multiplication Algorithm , 2017 .

[29]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.