HALO: A Hierarchical Memory Access Locality Modeling Technique For Memory System Explorations

Growing complexity of applications pose new challenges to memory system design due to their data intensive nature, complex access patterns, larger footprints, etc. The slow nature of full-system simulators, challenges of simulators to run deep software stacks of many emerging workloads, proprietary nature of software, etc. pose challenges to fast and accurate microarchitectural explorations of future memory hierarchies. One technique to mitigate this problem is to create spatio-temporal models of access streams and use them to explore memory system tradeoffs. However, existing memory stream models have weaknesses such as they only model temporal locality behavior or model spatio-temporal locality using global stride transitions, resulting in high storage/metadata overhead. In this paper, we propose HALO, a Hierarchical memory Access LOcality modeling technique that identifies patterns by isolating global memory references into localized streams and further zooming into each local stream capturing multi-granularity spatial locality patterns. HALO also models the interleaving degree between localized stream accesses leveraging coarse-grained reuse locality. We evaluate HALO's effectiveness in replicating original application performance using over 20K different memory system configurations and show that HALO achieves over 98.3%, 95.6%, 99.3% and 96% accuracy in replicating performance of prefetcher-enabled L1 & L2 caches, TLB and DRAM respectively. HALO outperforms the state-of-the-art memory cloning schemes, WEST and STM, while using ~39X less metadata storage than STM.

[1]  Allan Snavely,et al.  Accurate memory signatures and synthetic address traces for HPC applications , 2008, ICS '08.

[2]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[3]  Alper Sen,et al.  MINIME: Pattern-Aware Multicore Benchmark Synthesizer , 2015, IEEE Transactions on Computers.

[4]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[5]  L. John,et al.  Modeling program resource demand using inherent program characteristics , 2011, PERV.

[6]  Sanjeev Kumar,et al.  Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[7]  Lizy Kurian John,et al.  MAximum Multicore POwer (MAMPO) — An automatic multithreaded synthetic power virus generation framework for multicore systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Lieven Eeckhout,et al.  Performance analysis through synthetic trace generation , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[9]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[10]  Lieven Eeckhout,et al.  Measuring benchmark similarity using inherent program characteristics , 2006, IEEE Transactions on Computers.

[11]  Lieven Eeckhout,et al.  Benchmark synthesis for architecture and compiler exploration , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[12]  R. Govindarajan,et al.  ANATOMY: an analytical model of memory system performance , 2014, SIGMETRICS '14.

[13]  Lizy Kurian John VaWiRAM: a variable width random access memory module , 1996, Proceedings of 9th International Conference on VLSI Design.

[14]  Frederic T. Chong,et al.  HLS: combining statistical and symbolic simulation to guide microprocessor designs , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Lieven Eeckhout,et al.  Control flow modeling in statistical simulation for accurate and efficient processor design studies , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[16]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[17]  Lieven Eeckhout,et al.  Automated microprocessor stressmark generation , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[18]  Lieven Eeckhout,et al.  The Return of Synthetic Benchmarks , 2008 .

[19]  Yan Solihin,et al.  WEST: Cloning data cache behavior using Stochastic Traces , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[20]  Lieven Eeckhout,et al.  Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks , 2006, 2006 IEEE International Symposium on Workload Characterization.

[21]  Reena Panda,et al.  Proxy Benchmarks for Emerging Big-Data Workloads , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Lizy Kurian John,et al.  Synthesizing memory-level parallelism aware miniature clones for SPEC CPU2006 and ImplantBench workloads , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[23]  Lizy Kurian John,et al.  System-level Max POwer (SYMPO) - a systematic approach for escalating system-level power consumption using synthetic benchmarks , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Seth H. Pugsley,et al.  Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[26]  Onur Mutlu,et al.  Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[27]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[28]  Lizy Kurian John,et al.  Simulation points for SPEC CPU 2006 , 2008, 2008 IEEE International Conference on Computer Design.

[29]  Lieven Eeckhout,et al.  BLRL: Accurate and Efficient Warmup for Sampled Processor Simulation , 2005, Comput. J..

[30]  Reena Panda,et al.  Performance Characterization of Modern Databases on Out-of-Order CPUs , 2015, 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[31]  Zhe Wang,et al.  Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[32]  Aamer Jaleel,et al.  BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[33]  Yan Solihin,et al.  STM: Cloning the spatial and temporal memory access behavior , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[34]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[35]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[36]  Reena Panda,et al.  CAMP: Accurate modeling of core and memory locality for proxy generation of big-data applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[37]  Lizy Kurian John,et al.  Improved automatic testcase synthesis for performance model validation , 2005, ICS '05.

[38]  Reena Panda,et al.  Accurate address streams for LLC and beyond (SLAB): A methodology to enable system exploration , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[39]  André Seznec,et al.  A new case for the TAGE branch predictor , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Xiaona Li,et al.  BigDataBench: a Big Data Benchmark Suite from Web Search Engines , 2013, ArXiv.

[41]  Reena Panda,et al.  Statistical pattern based modeling of GPU memory access streams , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[42]  Lieven Eeckhout,et al.  Dispersing proprietary applications as benchmarks through code mutation , 2008, ASPLOS.

[43]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[44]  Yan Solihin,et al.  MeToo: Stochastic Modeling of Memory Traffic Timing Behavior , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[45]  Reena Panda,et al.  Data analytics workloads: Characterization and similarity analysis , 2014, 2014 IEEE 33rd International Performance Computing and Communications Conference (IPCCC).

[46]  Reena Panda,et al.  B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors , 2012, IEEE Computer Architecture Letters.

[47]  Reena Panda,et al.  SelSMaP: A Selective Stride Masking Prefetching Scheme , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[48]  Reena Panda,et al.  SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization , 2016, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[49]  Lizy Kurian John,et al.  Automatic testcase synthesis and performance model validation for high performance PowerPC processors , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[50]  Yan Solihin,et al.  MEMST: Cloning Memory Behavior using Stochastic Traces , 2015, MEMSYS.

[51]  Lieven Eeckhout,et al.  Evaluating the efficacy of statistical simulation for design space exploration , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.