Optimizing indirect memory references with milk

Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap. In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP. We evaluate the Milk compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3×.

[1]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[2]  Harry Berryman,et al.  Run-Time Scheduling and Execution of Loops on Message Passing Machines , 1990, J. Parallel Distributed Comput..

[3]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[4]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[5]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[6]  Chau-Wen Tseng,et al.  Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes , 1998, LCPC.

[7]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[8]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  L. Amaral,et al.  The web of human sexual contacts , 2001, Nature.

[11]  Martin L. Kersten,et al.  Optimizing Main-Memory Join on Modern Hardware , 2002, IEEE Trans. Knowl. Data Eng..

[12]  Peter Sanders,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[13]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[14]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[15]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[18]  Sherry Marcus,et al.  Graph-based technologies for intelligence analysis , 2004, CACM.

[19]  David A. Bader,et al.  On the architectural requirements for efficient execution of graph algorithms , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[20]  Jimmy Su,et al.  Automatic support for irregular computations in a high-level language , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[21]  Chau-Wen Tseng,et al.  Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.

[22]  Yogish Sabharwal,et al.  Optimizing the HPCC randomaccess benchmark on blue Gene/L Supercomputer , 2006, SIGMETRICS '06/Performance '06.

[23]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[24]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[25]  Peter Sanders,et al.  Engineering a Multi-core Radix Sort , 2011, Euro-Par.

[26]  Alfons Kemper,et al.  Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems , 2012, Proc. VLDB Endow..

[27]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[28]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[30]  Marco Rosa,et al.  Four degrees of separation , 2011, WebSci '12.

[31]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[32]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[33]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[34]  Mahmut T. Kandemir,et al.  Trading cache hit rate for memory performance , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[35]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[36]  Jaejin Lee,et al.  Versatile and scalable parallel histogram construction , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[37]  Jeff Chamberlain,et al.  Ivy Bridge Server: A Converged Design , 2015, IEEE Micro.

[38]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[39]  Avery Ching,et al.  One Trillion Edges: Graph Processing at Facebook-Scale , 2015, Proc. VLDB Endow..

[40]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX Annual Technical Conference.

[41]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[42]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[43]  Gustavo Alonso,et al.  Main-Memory Hash Joins on Modern Processor Architectures , 2015, IEEE Transactions on Knowledge and Data Engineering.