Understanding Object-level Memory Access Patterns Across the Spectrum

Memory accesses limit the performance and scalability of countless applications. Many design and optimization efforts will benefit from an in-depth understanding of memory access behavior, which is not offered by extant access tracing and profiling methods. In this paper, we adopt a holistic memory access profiling approach to enable a better understanding of program-system memory interactions. We have developed a two-pass tool adopting fast online and slow offline profiling, with which we have profiled, at the variable/object level, a collection of 38 representative applications spanning major domains (HPC, personal computing, data analytics, AI, graph processing, and datacenter workloads), at varying problem sizes. We have performed detailed result analysis and code examination. Our findings provide new insights into application memory behavior, including insights on per-object access patterns, adoption of data structures, and memory-access changes at different problem sizes. We find that scientific computation applications exhibit distinct behaviors compared to datacenter workloads, motivating separate memory system design/optimizations.

[1]  Chao Wang,et al.  NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Guy E. Blelloch,et al.  Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[4]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[5]  Gokcen Kestor,et al.  RTHMS: a tool for data placement on hybrid memory system , 2017, ISMM.

[6]  Mark Johnson,et al.  NCBI BLAST: a better web interface , 2008, Nucleic Acids Res..

[7]  Peter M. Kogge,et al.  On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications , 2007, IEEE Transactions on Computers.

[8]  R. Govindarajan,et al.  Probabilistic Shared Cache Management (PriSM) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[9]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[10]  Matthew L. Seidl,et al.  Predicting References to Dynamically Allocated Objects ; CU-CS-826-97 , 1997 .

[11]  Xiaofeng Gao,et al.  Exploiting Stability to Reduce Time-Space Cost for Memory Tracing , 2003, International Conference on Computational Science.

[12]  Benjamin G. Zorn,et al.  Using lifetime predictors to improve memory allocation performance , 1993, PLDI '93.

[13]  Eddie Kohler,et al.  Cache craftiness for fast multicore key-value storage , 2012, EuroSys '12.

[14]  Xiaofeng Gao,et al.  Reducing overheads for acquiring dynamic memory traces , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[15]  Easwaran Raman,et al.  Recursive data structure profiling , 2005, MSP '05.

[16]  Karsten Schwan,et al.  Data tiering in heterogeneous memory systems , 2016, EuroSys.

[17]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[18]  Berk Hess,et al.  GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers , 2015 .

[19]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[20]  Simon D. Hammond,et al.  Multi-Level Memory Policies: What You Add Is More Important Than What You Take Out , 2016, MEMSYS.

[21]  Alaa R. Alameldeen,et al.  Base-Victim Compression: An Opportunistic Cache Compression Architecture , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[22]  Thomas F. Wenisch,et al.  Temporal Streaming of Shared Memory , 2005, ISCA 2005.

[23]  Sally A. McKee,et al.  METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[24]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[25]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[26]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[27]  Kathryn S. McKinley,et al.  Reconsidering custom memory allocation , 2002, OOPSLA '02.

[28]  Stijn Eyerman,et al.  A first-order mechanistic model for architectural vulnerability factor , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[29]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[30]  Clark Verbrugge,et al.  Dynamic Data Structure Analysis for Java Programs , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[31]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[32]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[33]  Zhang Jing,et al.  Data locality characterization of OLTP applications and its effects on cache performance , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[34]  Aamer Jaleel,et al.  Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[35]  Onur Mutlu,et al.  The Dirty-Block Index , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[36]  Michael Stonebraker,et al.  OLTP through the looking glass, and what we found there , 2008, SIGMOD Conference.

[37]  Qiang Wu,et al.  Exposing memory access regularities using object-relative memory profiling , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[38]  Lizhong Chen,et al.  Futility Scaling: High-Associativity Cache Partitioning , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[39]  John M. Mellor-Crummey,et al.  A data-centric profiler for parallel programs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[40]  Thomas F. Wenisch,et al.  Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[41]  Bernd Hamann,et al.  Dissecting On-Node Memory Access Performance: A Semantic Approach , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[43]  Gu-Yeon Wei,et al.  ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency , 2008, 2008 International Symposium on Computer Architecture.

[44]  Akanksha Jain,et al.  Back to the Future: Leveraging Belady's Algorithm for Improved Cache Replacement , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[45]  Krishna M. Kavi,et al.  Gleipnir: a memory profiling and tracing tool , 2013, CARN.

[46]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[47]  Ben Zorn,et al.  Predicting References to Dynamically Allocated Objects , 1997 .

[48]  Alex Zelinsky,et al.  Learning OpenCV---Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008)[On the Shelf] , 2009, IEEE Robotics & Automation Magazine.

[49]  Rastislav Bodík,et al.  An efficient profile-analysis framework for data-layout optimizations , 2002, POPL '02.

[50]  David R. Kaeli,et al.  Profile-guided I/O partitioning , 2003, ICS '03.

[51]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[52]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[53]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[54]  Simon D. Hammond,et al.  Analyzing allocation behavior for multi-level memory , 2016, MEMSYS.

[55]  Erich Strohmaier,et al.  Quantifying Locality In The Memory Access Patterns of HPC Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[56]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[57]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[58]  Joseph Antony,et al.  Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[59]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[60]  Yi Yang,et al.  Locality Principle Revisited: A Probability-Based Quantitative Approach , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[61]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[62]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[63]  Gary R. Bradski,et al.  Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library , 2016 .

[64]  Daniel Sánchez,et al.  Whirlpool: Improving Dynamic Cache Management with Static Data Classification , 2016, ASPLOS.