Riposte: A trace-driven compiler and parallel VM for vector code in R

There is a growing utilization gap between modern hardware and modern programming languages for data analysis. Due to power and other constraints, recent processor design has sought improved performance through increased SIMD and multi-core parallelism. At the same time, high-level, dynamically typed languages for data analysis have become popular. These languages emphasize ease of use and high productivity, but have, in general, low performance and limited support for exploiting hardware parallelism. In this paper, we describe Riposte, a new runtime for the R language, which bridges this gap. Riposte uses tracing, a technique commonly used to accelerate scalar code, to dynamically discover and extract sequences of vector operations from arbitrary R code. Once extracted, we can fuse traces to eliminate unnecessary memory traffic, compile them to use hardware SIMD units, and schedule them to run across multiple cores, allowing us to fully utilize the available parallelism on modern shared-memory machines. Our evaluation shows that Riposte can run vector R code near the speed of hand-optimized C, 5–50× faster than the open source implementation of R, and can also linearly scale to 32 cores for some tasks. Across 12 different workloads we achieve an overall average speed-up of over 150× without explicit programmer parallelization.

[1]  Jan Vitek,et al.  Evaluating the Design of the R Language - Objects and Functions for Data Analysis , 2012, ECOOP.

[2]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[3]  Michael Franz,et al.  HotpathVM: an effective JIT compiler for resource-constrained devices , 2006, VEE '06.

[4]  Simon L. Peyton Jones,et al.  Harnessing the Multicores: Nested Data Parallelism in Haskell , 2008, FSTTCS.

[5]  References , 1971 .

[6]  Vivek Sarkar,et al.  Linear scan register allocation , 1999, TOPL.

[7]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[8]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[9]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[10]  Laurie J. Hendren,et al.  Optimizing Matlab through Just-In-Time Specialization , 2010, CC.

[11]  Alexandros Tzannes,et al.  Lazy binary-splitting: a run-time adaptive work-stealing scheduler , 2010, PPoPP '10.

[12]  Roman Leshchinskiy,et al.  Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.

[13]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[14]  Philip S. Abrams,et al.  An APL machine , 1970 .

[15]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[16]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[17]  Laurie J. Hendren,et al.  Staged Static Techniques to Efficiently Implement Array Copy Semantics in a MATLAB JIT Compiler , 2011, CC.

[18]  Luke Tierney Code analysis and parallelizing vector operations in R , 2009, Comput. Stat..

[19]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[20]  Michael D. McCool,et al.  Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[21]  Stefan Brunthaler,et al.  Inline Caching Meets Quickening , 2010, ECOOP.

[22]  Laurie J. Hendren,et al.  McFLAT: A Profile-Based Framework for MATLAB Loop Analysis and Transformations , 2010, LCPC.

[23]  Simon L. Peyton Jones,et al.  Regular, shape-polymorphic, parallel arrays in Haskell , 2010, ICFP '10.

[24]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[25]  Kenneth A. Ross,et al.  Scalable aggregation on multicore processors , 2011, DaMoN '11.

[26]  Hao Yu,et al.  State of the Art in Parallel Computing with R , 2009 .

[27]  Leonidas J. Guibas,et al.  Compilation and delayed evaluation in APL , 1978, POPL.

[28]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[29]  Jens Knoop Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software , 2011 .

[30]  Mason Chang,et al.  Trace-based just-in-time type specialization for dynamic languages , 2009, PLDI '09.

[31]  Terrence C. Miller Tentative compilation: A design for an APL compiler , 1979, APL '79.

[32]  William J. Dally,et al.  Compiling for stream processing , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  Rolf Dach,et al.  Technical Report 2012 , 2013 .