The 800 Pound Python in the Machine Learning Room

Modern machine learning frameworks have one commonality: the primary interface, for better or worse, is Python. Python is widely appreciated for its low barrier of entry due to its high-level built-ins and use of dynamic typing. However, these same features are also often attributed to causing the significant performance gap between the front-end in which users are asked to develop, and the highly-optimized back-end kernels which are ultimately called (generally written in a lower-level language like C). This has led to frameworks like TensorFlow requiring programs which consist almost entirely of API calls, with the appearance of only coincidentally being implemented in Python, the language. All recent ML frameworks have recognized this gap between usability and performance as a problem and aim to bridge the gap in generally one of two ways. In the case of tools like PyTorch’s JIT compiler, executed tensor operations can be recorded via tracing based on operator overloading. In the case of tools like PyTorch’s Torch Script, Python functions can be marked for translation entirely to a low-level language. However, both tracing and wholesale translation in this fashion have significant downsides in the respective inability to capture data-dependent control flow and the missed opportunities for optimization via execution while a low-level IR is built up. In this paper, we demonstrate the ability to overcome these shortcomings by performing a relatively simple source-tosource transformation, that allows for operator overloading techniques to be extended to language built-ins, including control flow operators, function definitions, etc. We utilize a preexisting PLT Redex implementation of Python’s core grammar in order to provide assurances that our transformations are semantics preserving with regard to standard Python. We then instantiate our overloading approach to generate code, which enables a form of multi-stage programming in Python. We capture the required transformations in a proof-of-concept, back-end agnostic, system dubbed Snek, and demonstrate their use in a production system released as part of TensorFlow, called AutoGraph. Finally, we provide an empirical evaluation of these systems and show performance benefits even with existing systems like TensorFlow, Torch Script, and Lantern as back-ends.

[1]  Kunle Olukotun,et al.  Optimizing data structures in high-level programs: new directions for extensible compilers based on staging , 2013, POPL.

[2]  Tiark Rompf,et al.  How to Architect a Query Compiler, Revisited , 2018, SIGMOD Conference.

[3]  Samuel Madden,et al.  Evaluating End-to-End Optimization for Data Analytics Applications in Weld , 2018, Proc. VLDB Endow..

[4]  A. Azzouz 2011 , 2020, City.

[5]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[6]  Fei Wang,et al.  AutoGraph: Imperative-style Coding with Graph-based Performance , 2018, SysML.

[7]  Kunle Olukotun,et al.  Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data , 2018, OSDI.

[8]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[9]  Fei Wang,et al.  A Language and Compiler View on Differentiable Programming , 2018, ICLR.

[10]  Tiark Rompf,et al.  SIMD intrinsics on managed language runtimes , 2018, CGO.

[11]  Tiark Rompf,et al.  Refunctionalization of abstract abstract machines: bridging the gap between abstract abstract machines and abstract definitional interpreters (functional pearl) , 2018, Proc. ACM Program. Lang..

[12]  Grigore Rosu,et al.  An overview of the K semantic framework , 2010, J. Log. Algebraic Methods Program..

[13]  Tiark Rompf,et al.  Backpropagation with Callbacks: Foundations for Efficient and Expressive Differentiable Programming , 2018, NeurIPS.

[14]  Kunle Olukotun,et al.  Language virtualization for heterogeneous parallel computing , 2010, OOPSLA.

[15]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[16]  Nada Amin,et al.  Collapsing towers of interpreters , 2017, Proc. ACM Program. Lang..

[17]  Kunle Olukotun,et al.  Surgical precision JIT compilers , 2014, PLDI.

[18]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[19]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[20]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[21]  Dan Moldovan,et al.  Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming , 2018, NeurIPS.

[22]  Stefan Behnel,et al.  Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.

[23]  Matthias Felleisen,et al.  Semantics Engineering with PLT Redex , 2009 .

[24]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[25]  Jan Vitek,et al.  Terra: a multi-stage language for high-performance computing , 2013, PLDI.

[26]  Bart van Merrienboer,et al.  Automatic differentiation in ML: Where we are and where we should be going , 2018, NeurIPS.

[27]  Dwight Guth,et al.  A formal semantics of Python 3.3 , 2013 .

[28]  Tiark Rompf,et al.  Lightweight Modular Staging and Embedded Compilers - Abstraction without Regret for High-Level High-Performance Programming , 2012 .

[29]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.