Legate NumPy: accelerated and distributed array computing

NumPy is a popular Python library used for performing array-based numerical computations. The canonical implementation of NumPy used by most programmers runs on a single CPU core and is parallelized to use multiple cores for some operations. This restriction to a single-node CPU-only execution limits both the size of data that can be handled and the potential speed of NumPy code. In this work we introduce Legate, a drop-in replacement for NumPy that requires only a single-line code change and can scale up to an arbitrary number of GPU accelerated nodes. Legate works by translating NumPy programs to the Legion programming model and then leverages the scalability of the Legion runtime system to distribute data and computations across an arbitrary sized machine. Compared to similar programs written in the distributed Dask array library in Python, Legate achieves speed-ups of up to 10X on 1280 CPUs and 100X on 256 GPUs.

[1]  Matthew Rocklin,et al.  Dask: Parallel Computation with Blocked algorithms and Task Scheduling , 2015, SciPy.

[2]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[3]  Alexander Aiken,et al.  Dependent partitioning , 2016, OOPSLA.

[4]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[5]  Ion Stoica,et al.  Numpywren: Serverless Linear Algebra , 2018, ArXiv.

[6]  Jinyang Li,et al.  Spartan: A Distributed Array Framework with Smart Tiling , 2015, USENIX Annual Technical Conference.

[7]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Troels Blum,et al.  Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster , 2013 .

[9]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Carlos Maltzahn,et al.  Integrating External Resources with a Task-Based Programming Model , 2017, 2017 IEEE 24th International Conference on High Performance Computing (HiPC).

[11]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[12]  Robert R. Lewis,et al.  Using the Global Arrays Toolkit to Reimplement NumPy for Distributed Computation , 2011 .

[13]  Alexander Aiken,et al.  Realm: An event-based low-level runtime for distributed memory architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[14]  Alexander Aiken,et al.  Language support for dynamic, hierarchical data partitioning , 2013, OOPSLA.

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  Samuel Madden,et al.  Weld: Rethinking the Interface Between Data-Intensive Applications , 2017, ArXiv.

[17]  S. Hido,et al.  CuPy : A NumPy-Compatible Library for NVIDIA GPU Calculations , 2017 .

[18]  Michael Merrill,et al.  Arkouda: interactive data exploration backed by Chapel , 2019 .

[19]  Wen Zhang,et al.  Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Vinod Grover,et al.  Automatic acceleration of Numpy applications on GPUs and multicore CPUs , 2019, ArXiv.

[21]  Michael Garland,et al.  Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-Based Runtimes , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .