Productivity, portability, performance: data-centric Python

Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Python's high productivity while achieving portable performance across different architectures. The workflow's key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes.

[1]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[2]  Corporate Rice University,et al.  High performance Fortran language specification , 1993, FORF.

[3]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[4]  Torsten Hoefler,et al.  Dawn: a High-level Domain-Specific Language Compiler Toolchain for Weather and Climate Applications , 2020, Supercomput. Front. Innov..

[5]  Stefan Behnel,et al.  Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.

[6]  Torsten Hoefler,et al.  Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.

[7]  Lisandro Dalcin,et al.  Parallel distributed computing using Python , 2011 .

[8]  Nikoli Dryden,et al.  Data Movement Is All You Need: A Case Study on Optimizing Transformers , 2020, MLSys.

[9]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[10]  Michael Garland,et al.  Legate NumPy: accelerated and distributed array computing , 2019, SC.

[11]  Alexandros Nikolaos Ziogas,et al.  A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations , 2019, SC.

[12]  Zach DeVito,et al.  Darkroom , 2014 .

[13]  Tal Ben-Nun,et al.  Workflows are the New Applications: Challenges in Performance, Portability, and Productivity , 2020, 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[14]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[15]  Lorena A. Barba,et al.  CFD Python: the 12 steps to Navier-Stokes equations , 2018, Journal of Open Source Education.

[16]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Uday Bondhugula,et al.  MLIR: Scaling Compiler Infrastructure for Domain Specific Computation , 2021, 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[19]  Torsten Hoefler,et al.  Domain-Specific Multi-Level IR Rewriting for GPU , 2020, ACM Trans. Archit. Code Optim..

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[22]  Michael Lange,et al.  Devito: Towards a Generic Finite Difference DSL Using Symbolic Python , 2016, 2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC).

[23]  Torsten Hoefler,et al.  Transformations of High-Level Synthesis Codes for High-Performance Computing , 2018, IEEE Transactions on Parallel and Distributed Systems.

[24]  Torsten Hoefler,et al.  StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems , 2020, ArXiv.

[25]  M. Baldauf,et al.  Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities , 2011 .

[26]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[27]  Torsten Hoefler,et al.  Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication , 2019, SC.

[28]  Troels Blum,et al.  Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster , 2013 .

[29]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Mathieu Luisier,et al.  Ab-initio quantum transport simulation of self-heating in single-layer 2-D materials , 2017, 1812.01970.

[32]  Dan Bonachea,et al.  GASNet-EX: A High-Performance, Portable Communication Library for Exascale , 2018, LCPC.

[33]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[34]  Mehdi Amini,et al.  Pythran: Enabling Static Optimization of Scientific Python Programs , 2013, SciPy.

[35]  R. Nigel Horspool,et al.  Simple Generation of Static Single-Assignment Form , 2000, CC.

[36]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Alexander Aiken,et al.  Regent: a high-productivity programming language for HPC with logical regions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Jason Sewall,et al.  Data Parallel C++: Enhancing SYCL Through Extensions for Productivity and Performance , 2020, IWOCL.

[39]  Jérôme Kieffer,et al.  PyFAI: a Python library for high performance azimuthal integration on GPU , 2013, Powder Diffraction.

[40]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[41]  Carlo A. Furia,et al.  A Comparative Study of Programming Languages in Rosetta Code , 2014, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[42]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[43]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[44]  Robert Pincus,et al.  The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models , 2018, PASC.

[45]  Torsten Hoefler,et al.  Application-oriented ping-pong benchmarking: how to assess the real communication overheads , 2014, Computing.

[46]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .