Intrepydd: performance, productivity, and portability for data science application kernels

Major simultaneous disruptions are currently under way in both hardware and software. In hardware, ``extreme heterogeneity'' has become critical to sustaining cost and performance improvements after Moore's Law, but poses productivity and portability challenges for developers. In software, the rise of large-scale data science is driven by developers who come from diverse backgrounds and, moreover, who demand the rapid prototyping and interactive-notebook capabilities of high-productivity languages like Python. We introduce the Intrepydd programming system, which enables data scientists to write application kernels with high performance, productivity, and portability on current and future hardware. Intrepydd is based on Python, though the approach can be applied to other base languages as well. To deliver high performance, the Intrepydd toolchain uses ahead-of-time (AOT) compilation and high-level compiler optimizations of Intrepydd kernels. Intrepydd achieves portability by its ability to compile kernels for execution on different hardware platforms, and for invocation from Python or C++ main programs. An empirical evaluation shows significant performance improvements relative to Python, and the suitability of Intrepydd for mapping on to post-Moore accelerators and architectures with relative ease. We believe that Intrepydd represents a new direction of ``Discipline-Aware Languages'' (DiALs), which brings us closer to the holy grail of obtaining productivity and portability with higher performance than current Python-like languages, and with more generality than current domain-specific languages and libraries.

[1]  Artsiom Ablavatski,et al.  Two-Pass Softmax Algorithm , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[2]  Stanley C. Eisenstat,et al.  Yale sparse matrix package I: The symmetric codes , 1982 .

[3]  Michael Lange,et al.  Devito: Towards a Generic Finite Difference DSL Using Symbolic Python , 2016, 2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC).

[4]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[5]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[6]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[7]  Marat Dukhan Indirect Deconvolution Algorithm , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[8]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[9]  Thomas M. Conte,et al.  Tackling memory access latency through DRAM row management , 2018, MEMSYS.

[10]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[11]  Richard W. Vuduc,et al.  A Microbenchmark Characterization of the Emu Chick , 2018, Parallel Comput..

[12]  Carole-Jean Wu,et al.  Machine Learning at Facebook: Understanding Inference at the Edge , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[14]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[15]  Hongbo Rong,et al.  Sparso: Context-driven optimizations of sparse linear algebra , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[16]  Saman P. Amarasinghe,et al.  A Common Runtime for High Performance Data Analysis , 2017, CIDR.

[17]  Matthew B. Dwyer,et al.  Towards Self-Verification in Finite Difference Code Generation , 2017, CORRECTNESS@SC.

[18]  Vivek Sarkar,et al.  Optimal weighted loop fusion for parallel programs , 1997, SPAA '97.

[19]  Richard W. Vuduc,et al.  An Initial Characterization of the Emu Chick , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Thomas M. Conte,et al.  Rebooting Computing: The Road Ahead , 2017, Computer.

[21]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[22]  Vivek Sarkar,et al.  A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System , 2018, MCHPC@SC.

[23]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[24]  Håkan Ardö,et al.  Loop-aware optimizations in PyPy's tracing JIT , 2012, DLS '12.

[25]  Mehmet Deveci,et al.  Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[26]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[27]  Jeanine Cook,et al.  MetaStrider , 2019, ACM Trans. Archit. Code Optim..

[28]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[29]  Anders Logg,et al.  The FEniCS Project Version 1.5 , 2015 .

[30]  Jun Yang,et al.  FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs , 2018, ArXiv.

[31]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[32]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[33]  Andrew C. Rice,et al.  Verifying spatial properties of array computations , 2017, Proc. ACM Program. Lang..

[34]  G Van ZeeField,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015 .

[35]  M Mernik,et al.  When and how to develop domain-specific languages , 2005, CSUR.

[36]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[37]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[38]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[39]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[40]  Stefan Behnel,et al.  Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.

[41]  Vivek Sarkar,et al.  A Composable Deadlock-Free Approach to Object-Based Isolation , 2015, Euro-Par.

[42]  Steve Plimpton,et al.  FireHose Streaming Benchmarks , 2015 .

[43]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[44]  Timothy A. Davis,et al.  Graph algorithms via SuiteSparse: GraphBLAS: triangle counting and K-truss , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[45]  William F. Tinney,et al.  Techniques for Exploiting the Sparsity or the Network Admittance Matrix , 1963 .

[46]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[47]  Ken Kennedy,et al.  Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries , 2001, J. Parallel Distributed Comput..

[48]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.