Cimple: instruction and memory level parallelism: a DSL for uncovering ILP and MLP

Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for inflight memory requests. These resources, however, often exhibit poor utilization rates on workloads with large working sets, e.g., in-memory databases, key-value stores, and graph analytics, as compilers and hardware struggle to expose ILP and MLP from the instruction stream automatically. In this paper, we introduce the IMLP (Instruction and Memory Level Parallelism) task programming model. IMLP tasks execute as coroutines that yield execution at annotated long-latency operations, e.g., memory accesses, divisions, or unpredictable branches. IMLP tasks are interleaved on a single thread, and integrate well with thread parallelism and vectorization. Our DSL embedded in C++, Cimple, allows exploration of task scheduling and transformations, such as buffering, vectorization, pipelining, and prefetching. We demonstrate state-of-the-art performance on core algorithms used in in-memory databases that operate on arrays, hash tables, trees, and skip lists. Cimple applications reach 2.5× throughput gains over hardware multithreading on a multi-core, and 6.4× single thread speedup.

[1]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[2]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[3]  Christian Queinnec,et al.  A dynamic extent control operator for partial continuations , 1991, POPL '91.

[4]  Craig Freedman,et al.  Hekaton: SQL server's memory-optimized OLTP engine , 2013, SIGMOD '13.

[5]  Eric Rotenberg,et al.  Control-Flow Decoupling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  David Gregg,et al.  Optimizing indirect branch prediction accuracy in virtual machine interpreters , 2003, PLDI '03.

[7]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[8]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[9]  Donald Yeung,et al.  Multi-chain prefetching: effective exploitation of inter-chain memory parallelism for pointer-chasing codes , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[10]  Kenneth A. Ross,et al.  Making B+- trees cache conscious in main memory , 2000, SIGMOD '00.

[11]  Norman May,et al.  Interleaving with Coroutines: A Practical Approach for Robust Index Joins , 2017, Proc. VLDB Endow..

[12]  Jeff Chamberlain,et al.  Ivy Bridge Server: A Converged Design , 2015, IEEE Micro.

[13]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[14]  Jim Hunter,et al.  Exploiting Coroutines to Attack the "Killer Nanoseconds" , 2018, Proc. VLDB Endow..

[15]  André Seznec,et al.  Branch prediction and the performance of interpreters — Don't trust folklore , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[16]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[17]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[19]  Minxuan Zhang,et al.  Advanced Computer Architecture , 2016, Communications in Computer and Information Science.

[20]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[21]  Stefanos Kaxiras,et al.  Clairvoyance: Look-ahead compile-time scheduling , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[22]  Herb Sutter,et al.  Task Region | N3832 , 2014 .

[23]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[24]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[25]  Michael Stonebraker,et al.  E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing , 2014, Proc. VLDB Endow..

[26]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[27]  Gustavo Alonso,et al.  Main-Memory Hash Joins on Modern Processor Architectures , 2015, IEEE Transactions on Knowledge and Data Engineering.

[28]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[29]  Anastasia Ailamaki,et al.  Improving hash join performance through prefetching , 2004, Proceedings. 20th International Conference on Data Engineering.

[30]  A. Azzouz 2011 , 2020, City.

[31]  Todd C. Mowry,et al.  Improving index performance through prefetching , 2001, SIGMOD '01.

[32]  Pradeep Dubey,et al.  PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors , 2011, Proc. VLDB Endow..

[33]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[34]  Josep Torrellas,et al.  Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[35]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[36]  Pat Morin,et al.  Array Layouts for Comparison-Based Searching , 2015, ACM J. Exp. Algorithmics.

[37]  Viktor Leis,et al.  The adaptive radix tree: ARTful indexing for main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[38]  Richard W. Vuduc,et al.  When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.

[39]  Junjie Wu,et al.  Advanced Computer Architecture , 2014, Communications in Computer and Information Science.

[40]  Ulf Leser,et al.  Cache-Sensitive Skip List: Efficient Range Queries on Modern CPUs , 2016, ADMS/IMDM@VLDB.

[41]  Robert Hieb,et al.  Representing control in the presence of first-class continuations , 1990, PLDI '90.

[42]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[43]  David A. Patterson,et al.  Reducing Pagerank Communication via Propagation Blocking , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[44]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[45]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[46]  Martin C. Rinard,et al.  Cimple: Instruction and Memory Level Parallelism , 2018, ArXiv.

[47]  Viktor Leis,et al.  Processing in the Hybrid OLTP & OLAP Main-Memory Database System HyPer , 2013, IEEE Data Eng. Bull..

[48]  Pradeep Dubey,et al.  Fast Updates on Read-Optimized Databases Using Multi-Core CPUs , 2011, Proc. VLDB Endow..

[49]  Dan S. Wallach,et al.  Denial of Service via Algorithmic Complexity Attacks , 2003, USENIX Security Symposium.

[50]  Babak Falsafi,et al.  Asynchronous Memory Access Chaining , 2015, Proc. VLDB Endow..

[51]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[52]  Weiyun Huang,et al.  Real-Time Analytical Processing with SQL Server , 2015, Proc. VLDB Endow..

[53]  Todd C. Mowry,et al.  Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last , 2017, Proc. VLDB Endow..

[54]  Xin Chen,et al.  F1: the fault-tolerant distributed RDBMS supporting google's ad business , 2012, SIGMOD Conference.

[55]  Yunming Zhang,et al.  Optimizing indirect memory references with milk , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[56]  R. Kent Dybvig,et al.  Representing control in the presence of one-shot continuations , 1996, PLDI '96.

[57]  Thomas Neumann,et al.  Efficiently Compiling Efficient Query Plans for Modern Hardware , 2011, Proc. VLDB Endow..

[58]  Allen Newell,et al.  An introduction to information processing language V , 1960, Commun. ACM.

[59]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[60]  Sudipta Sengupta,et al.  Indexing on modern hardware: hekaton and beyond , 2014, SIGMOD Conference.

[61]  Allen Newell,et al.  The logic theory machine-A complex information processing system , 1956, IRE Trans. Inf. Theory.

[62]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[63]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[64]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).