OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowing library developers to enhance performance of their applications by harnessing the computing power offered by High Performance Computing (HPC) platforms. Efficient communication is key to scaling applications on parallel systems, which is typically enabled by the Message Passing Interface (MPI) standard and compliant libraries on HPC hardware. mpi4py is a Python-based communication library that provides an MPI-like interface for Python applications allowing application developers to utilize parallel processing elements including GPUs. However, there is currently no benchmark suite to evaluate communication performance of mpi4py—and Python MPI codes in general—on modern HPC systems. In order to bridge this gap, we propose OMB-Py—Python extensions to the open-source OSU MicroBenchmark (OMB) suite—aimed to evaluate communication performance of MPI-based parallel applications in Python. To the best of our knowledge, OMB-Py is the first communication benchmark suite for parallel Python applications. OMB-Py consists of a variety of point-to-point and collective communication benchmark tests that are implemented for a range of popular Python libraries including NumPy, CuPy, Numba, and PyCUDA. We also provide Python implementation for several distributed ML algorithms as benchmarks to understand the potential gain in performance for ML/DL workloads. Our evaluation reveals that mpi4py introduces a small overhead when compared to native MPI libraries. We also evaluate the ML/DL workloads and report up to 106x speedup on 224 CPU cores compared to sequential execution. We plan to publicly release OMB-Py to benefit Python HPC community.

[1]  K. Jarrod Millman,et al.  Python for Scientists and Engineers , 2011, Comput. Sci. Eng..

[2]  William Schroeder,et al.  The Visualization Toolkit: An Object-Oriented Approach to 3-D Graphics , 1997 .

[3]  David E. Keyes,et al.  Fast parallel multidimensional FFT using advanced MPI , 2018, J. Parallel Distributed Comput..

[4]  S. Hido,et al.  CuPy : A NumPy-Compatible Library for NVIDIA GPU Calculations , 2017 .

[5]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[6]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[7]  Ataul Aziz Ikram,et al.  Performance Comparison of MPICH and MPI4py on Raspberry Pi-3B Beowulf Cluster , 2019, ArXiv.

[8]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[9]  Nicolas Pinto,et al.  PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation , 2009, Parallel Comput..

[10]  Lisandro Dalcin,et al.  Parallel distributed computing using Python , 2011 .

[11]  Mario A. Storti,et al.  MPI for Python: Performance improvements and MPI-2 extensions , 2008, J. Parallel Distributed Comput..

[12]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Ross Smith,et al.  Performance of MPI Codes Written in Python with NumPy and mpi4py , 2016, 2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC).

[15]  Hari Subramoni,et al.  The MVAPICH project: Transforming research into high-performance MPI library for HPC community , 2020, J. Comput. Sci..

[16]  Lisandro Dalcin,et al.  mpi4py: Status Update After 12 Years of Development , 2021, Computing in Science & Engineering.

[17]  Dhabaleswar K. Panda,et al.  OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters , 2012, EuroMPI.

[18]  M. Norman,et al.  yt: A MULTI-CODE ANALYSIS TOOLKIT FOR ASTROPHYSICAL SIMULATION DATA , 2010, 1011.3514.