Python programmers have GPUs too: automatic Python loop parallelization with staged dependence analysis

Python is a popular language for end-user software development in many application domains. End-users want to harness parallel compute resources effectively, by exploiting commodity manycore technology including GPUs. However, existing approaches to parallelism in Python are esoteric, and generally seem too complex for the typical end-user developer. We argue that implicit, or automatic, parallelization is the best way to deliver the benefits of manycore to end-users, since it avoids domain-specific languages, specialist libraries, complex annotations or restrictive language subsets. Auto-parallelization fits the Python philosophy, provides effective performance, and is convenient for non-expert developers. Despite being a dynamic language, we show that Python is a suitable target for auto-parallelization. In an empirical study of 3000+ open-source Python notebooks, we demonstrate that typical loop behaviour ‘in the wild’ is amenable to auto-parallelization. We show that staging the dependence analysis is an effective way to maximize performance. We apply classical dependence analysis techniques, then leverage the Python runtime’s rich introspection capabilities to resolve additional loop bounds and variable types in a just-in-time manner. The parallel loop nest code is then converted to CUDA kernels for GPU execution. We achieve orders of magnitude speedup over baseline interpreted execution and some speedup (up to 50x, although not consistently) over CPU JIT-compiled execution, across 12 loop-intensive standard benchmarks.

[1]  Gavin Brown,et al.  Boosting Java Performance Using GPGPUs , 2015, ARCS.

[2]  Andreas Klöckner,et al.  Loo.py: transformation-based code generation for GPUs and CPUs , 2014, ARRAY@PLDI.

[3]  Lukas Stadler,et al.  Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation , 2017, VEE.

[4]  Jeremy Singer,et al.  ALPyNA: acceleration of loops in Python for novel architectures , 2019, ARRAY@PLDI.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Gregg Rothermel,et al.  End-user software engineering , 2004, Commun. ACM.

[7]  Warren Harrison From the Editor: The Dangers of End-User Programming , 2004, IEEE Softw..

[8]  Hidehiko Masuhara,et al.  A data-parallel extension to Ruby for GPGPU: toward a framework for implementing domain-specific optimizations , 2012, RAM-SE '12.

[9]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[10]  Dennis Shasha,et al.  Parakeet: a just-in-time parallel accelerator for python , 2012, HotPar'12.

[11]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[12]  Michael Franz,et al.  Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization , 2018, ECOOP.

[13]  Gustavo Alonso,et al.  Scaling Astroinformatics: Python + Automatic Parallelization , 2014, Computer.

[14]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[15]  Nicolas Pinto,et al.  PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation , 2009, Parallel Comput..

[16]  J. Demmel,et al.  Sun Microsystems , 1996 .

[17]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[18]  Christos Kotselidis,et al.  Using compiler snippets to exploit parallelism on heterogeneous hardware: a Java reduction case study , 2018, VMIL@SPLASH.

[19]  Tatiana Shpeisman,et al.  River trail: a path to parallelism in JavaScript , 2013, OOPSLA.

[20]  James D. Hollan,et al.  Exploration and Explanation in Computational Notebooks , 2018, CHI.

[21]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[22]  Rupesh Nasre,et al.  A study on popular auto‐parallelization frameworks , 2019, Concurr. Comput. Pract. Exp..

[23]  Adam Welc,et al.  Adaptive data parallelism for internet clients on heterogeneous platforms , 2012, DLS '12.

[24]  F. Black The pricing of commodity contracts , 1976 .

[25]  Kurt Keutzer,et al.  Automatic generation of application-specific accelerators for FPGAs from python loop nests , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[26]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[27]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[28]  Berk Ekmekci,et al.  An Introduction to Programming for Bioscientists: A Python-Based Primer , 2016, PLoS Comput. Biol..

[29]  Manuel Selva,et al.  Full runtime polyhedral optimizing loop transformations with the generation, instantiation, and scheduling of code‐bones , 2017, Concurr. Comput. Pract. Exp..

[30]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.