Data-driven spatial locality

Researchers and practitioners dedicate a lot of effort to improving spatial locality in their programs. Hardware caches rely on spatial locality for efficient operation; when it is absent, they waste memory bandwidth and cache space by fetching data that is never used before it is evicted. Improving spatial locality is difficult. For the most part, these are manual efforts by expert programmers, requiring substantial insight into the program's data layout and data access pattern. This work introduces Access Graphs: a novel abstraction of memory access patterns that exposes spatial locality features and allows for automatic computation of better memory layouts. Using access graphs and a set of analysis algorithms and tools, we are able to significantly improve simulated cache miss rates by changing data layout. Further, we use random forest classifiers to automatically identify features of the data that correlate with how the data is actually used. We build a memory allocator that uses these features to guide data allocation at runtime and achieves a better spatial locality and improved performance as a result.

[1]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[2]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[3]  J. Shewchuk,et al.  Streaming computation of Delaunay triangulations , 2006, SIGGRAPH '06.

[4]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[5]  Shankar Prasad Sastry,et al.  Dynamic meshing techniques for quality improvement, untangling, and warping , 2012 .

[6]  Brian J. N. Wylie,et al.  Memory Profiling using Hardware Counters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Weng-Fai Wong,et al.  Dynamic cache contention detection in multi-threaded applications , 2011, VEE '11.

[8]  Jordi Petit,et al.  Experiments on the minimum linear arrangement problem , 2003, ACM J. Exp. Algorithmics.

[9]  Zhe Wang,et al.  Ferret: a toolkit for content-based similarity search of feature-rich data , 2006, EuroSys.

[10]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[11]  John M. Mellor-Crummey,et al.  Pinpointing data locality problems using data-centric analysis , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[12]  Vikram S. Adve,et al.  Automatic pool allocation: improving performance by controlling data structure layout in the heap , 2005, PLDI '05.

[13]  A. Azzouz 2011 , 2020, City.

[14]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[15]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[16]  Qin Zhao,et al.  Umbra: efficient and scalable memory shadowing , 2010, CGO '10.

[17]  Gerth Stølting Brodal,et al.  Cache oblivious search trees via binary trees of small height , 2001, SODA '02.

[18]  Vivien Quéma,et al.  MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[19]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[20]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[21]  Maurice Herlihy,et al.  Concurrent Data Structures for Near-Memory Computing , 2017, SPAA.

[22]  Gerth Stølting Brodal,et al.  Cache-Oblivious Algorithms and Data Structures , 2004, SWAT.

[23]  Bojan Mohar,et al.  Optimal linear labelings and eigenvalues of graphs , 1992, Discret. Appl. Math..

[24]  Martin Isenburg,et al.  Streaming meshes , 2005, VIS 05. IEEE Visualization, 2005..

[25]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[26]  Dinesh Manocha,et al.  Cache‐Efficient Layouts of Bounding Volume Hierarchies , 2006, Comput. Graph. Forum.

[27]  Alexandra Fedorova,et al.  DINAMITE: A modern approach to memory performance profiling , 2016, ArXiv.

[28]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[29]  Robert Tappan Morris,et al.  Locating cache performance bottlenecks using data profiling , 2010, EuroSys '10.

[30]  Hao Luo,et al.  HOTL: a higher order theory of locality , 2013, ASPLOS '13.

[31]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[32]  D. Manocha,et al.  Cache-oblivious mesh layouts , 2005, ACM Trans. Graph..

[33]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[34]  Emery D. Berger,et al.  SHERIFF: precise detection and automatic mitigation of false sharing , 2011, OOPSLA '11.

[35]  Benjamin G. Zorn,et al.  BIT: A Tool for Instrumenting Java Bytecodes , 1997, USENIX Symposium on Internet Technologies and Systems.

[36]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[38]  Margaret Martonosi,et al.  MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.

[39]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[40]  Pedro V. Sander,et al.  Fast triangle reordering for vertex locality and reduced overdraw , 2007, SIGGRAPH 2007.

[41]  BodíkRastislav,et al.  An efficient profile-analysis framework for data-layout optimizations , 2002 .

[42]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[43]  James C. Browne,et al.  Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44]  Ming Wu,et al.  Managing Large Graphs on Multi-Cores with Graph Awareness , 2012, USENIX Annual Technical Conference.

[45]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[46]  Uri C. Weiser,et al.  Semantic locality and context-based prefetching using reinforcement learning , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[47]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[48]  Valerio Pascucci,et al.  Simple and Efficient Mesh Layout with Space-Filling Curves , 2012, J. Graph. Tools.

[49]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[50]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.