Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography data

While the advances in synchrotron light sources, together with the development of focusing optics and detectors, allow nanoscale ptychographic imaging of materials and biological specimens, the corresponding experiments can yield terabytescale large volumes of data that can impose a heavy burden on the computing platform. While Graphical Processing Units (GPUs) provide high performance for such large-scale ptychography datasets, a single GPU is typically insufficient for analysis and reconstruction. Several existing works have considered leveraging multiple GPUs to accelerate the ptychographic reconstruction. However, they utilize only Message Passing Interface (MPI) to handle the communications between GPUs. It poses inefficiency for the configuration that has multiple GPUs in a single node, especially while processing a single large projection, since it provide no optimizations to handle the heterogeneous GPU interconnections containing both low-speed links, e.g., PCIe, and high-speed links, e.g., NVLink. In this paper, we provide a multi-GPU implementation that can effectively solve large-scale ptychographic reconstruction problem with optimized performance on intra-node multi-GPU. We focus on the conventional maximum-likelihood reconstruction problem using conjugate-gradient (CG) for the solution and propose a novel hybrid parallelization model to address the performance bottlenecks in CG solver. Accordingly, we develop a tool called PtyGer (Ptychographic GPU(multiple)-based reconstruction), implementing our hybrid parallelization model design. The comprehensive evaluation verifies that PtyGer can fully preserve the original algorithm’s accuracy while achieving outstanding intra-node GPU scalability.

[1]  E. Polak,et al.  Note sur la convergence de méthodes de directions conjuguées , 1969 .

[2]  Yi Yang,et al.  BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing , 2015, ICS.

[3]  Manuel Guizar-Sicairos,et al.  PtychoShelves, a versatile high-level framework for high-performance analysis of ptychographic data , 2020, Journal of applied crystallography.

[4]  J. Rodenburg,et al.  An improved ptychographical phase retrieval algorithm for diffractive imaging. , 2009, Ultramicroscopy.

[5]  Nikhil R. Devanur,et al.  Blink: Fast and Generic Collectives for Distributed ML , 2019, MLSys.

[6]  B. Reiffen,et al.  An optimum demodulator for poisson processes: Photon source detectors , 1963 .

[7]  Wen-mei W. Hwu,et al.  MemXCT: memory-centric X-ray CT reconstruction with massive parallelization , 2019, SC.

[8]  Andreas Menzel,et al.  Probe retrieval in ptychographic coherent diffractive imaging. , 2009, Ultramicroscopy.

[9]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[10]  Wu-chun Feng,et al.  AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-Based Multi-and Many-Core Processors , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[12]  Ya-Xiang Yuan,et al.  Convergence Properties of Nonlinear Conjugate Gradient Methods , 1999, SIAM J. Optim..

[13]  A. G. Cullis,et al.  Hard-x-ray lensless imaging of extended objects. , 2007, Physical review letters.

[14]  P. Thibault X-ray ptychography , 2011 .

[15]  Jonathan Hines,et al.  Stepping up to Summit , 2018, Comput. Sci. Eng..

[16]  Sven Leyffer,et al.  Joint ptycho-tomography reconstruction through alternating direction method of multipliers. , 2019, Optics express.

[17]  Michael E. Papka,et al.  2018 Annual Report - Argonne Leadership Computing Facility , 2018 .

[18]  B Enders,et al.  A computational framework for ptychographic reconstructions , 2016, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[19]  Michela Becchi,et al.  GPU-Based Static Data-Flow Analysis for Fast and Scalable Android App Vetting , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  Jack Deslippe,et al.  Comparing Managed Memory and ATS with and without Prefetching on NVIDIA Volta GPUs , 2019, 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[21]  Ya-Xiang Yuan,et al.  A Nonlinear Conjugate Gradient Method with a Strong Global Convergence Property , 1999, SIAM J. Optim..

[22]  Dhabaleswar K. Panda,et al.  Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation , 2018, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[23]  J. Rodenburg,et al.  A phase retrieval algorithm for shifting illumination , 2004 .

[24]  Pablo Enfedaque,et al.  High Performance Partial Coherent X-Ray Ptychography , 2019, ICCS.

[25]  A. G. Cullis,et al.  Transmission microscopy without lenses for objects of unlimited size. , 2007, Ultramicroscopy.

[26]  Wu-chun Feng,et al.  cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[27]  Stefan Vogt,et al.  The Velociprobe: An ultrafast hard X-ray nanoprobe for high-resolution ptychographic imaging. , 2019, The Review of scientific instruments.

[28]  John D. Owens,et al.  Multi-GPU Graph Analytics , 2015, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[29]  J. Rodenburg,et al.  Movable aperture lensless transmission microscopy: a novel phase retrieval algorithm. , 2004, Physical review letters.

[30]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[31]  K. Nugent Coherent methods in the X-ray sciences , 2009, 0908.3064.

[32]  O. Bunk,et al.  High-throughput ptychography using Eiger: scanning X-ray nano-imaging of extended regions. , 2014, Optics express.

[33]  Daniel J Ching,et al.  Rotation-as-fast-axis scanning-probe x-ray tomography: the importance of angular diversity for fly-scan modes. , 2018, Applied optics.

[34]  Ondřej Mandula,et al.  PyNX.Ptycho: a computing library for X-ray coherent diffraction imaging of nanostructures , 2016 .

[35]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[36]  Garth J. Williams,et al.  Keyhole coherent diffractive imaging , 2008 .

[37]  D. R. Luke Relaxed averaged alternating reflections for diffraction imaging , 2004, math/0405208.

[38]  Xiaodong Yu,et al.  Algorithms and Frameworks for Accelerating Security Applications on HPC Platforms , 2019 .

[39]  J. Miao,et al.  Beyond crystallography: Diffractive imaging using coherent x-ray light sources , 2015, Science.

[40]  Keith A. Nugent,et al.  Coherent lensless X-ray imaging , 2010 .

[41]  Amnon Barak,et al.  Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  J. Fienup,et al.  Phase retrieval with transverse translation diversity: a nonlinear optimization approach. , 2008, Optics express.

[43]  Wu-chun Feng,et al.  GPU-Based Iterative Medical CT Image Reconstructions , 2018, Journal of Signal Processing Systems.

[44]  Talita Perciano,et al.  SHARP: a distributed, GPU-based ptychographic solver , 2016, 1602.01448.

[45]  Francesco De Carlo,et al.  TomoPy: a framework for the analysis of synchrotron tomographic data , 2014, Journal of synchrotron radiation.

[46]  Sven Leyffer,et al.  Photon-limited ptychography of 3D objects via Bayesian reconstruction , 2019, OSA Continuum.

[47]  Keshav Pingali,et al.  Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations , 2017, PPoPP.

[48]  Franz Pfeiffer,et al.  Ptychography & lensless X-ray imaging , 2008 .

[49]  Xiaodong Yu,et al.  Exploring different automata representations for efficient regular expression matching on GPUs , 2013, PPoPP '13.

[50]  O. Bunk,et al.  High-Resolution Scanning X-ray Diffraction Microscopy , 2008, Science.

[51]  Tom Peterka,et al.  Parallel ptychographic reconstruction. , 2014, Optics express.

[52]  Manuel Guizar-Sicairos,et al.  Iterative least-squares solver for generalized maximum-likelihood ptychography. , 2018, Optics express.

[53]  Boris Polyak The conjugate gradient method in extremal problems , 1969 .

[54]  P. Thibault,et al.  Maximum-likelihood refinement for coherent diffractive imaging , 2012 .

[55]  Hao Wang,et al.  An Enhanced Image Reconstruction Tool for Computed Tomography on GPUs , 2017, Conf. Computing Frontiers.

[56]  J. Rodenburg,et al.  Ptychographic transmission microscopy in three dimensions using a multi-slice approach. , 2012, Journal of the Optical Society of America. A, Optics, image science, and vision.

[57]  Xiaodong Yu,et al.  GPU acceleration of regular expression matching for large datasets: exploring the implementation space , 2013, CF '13.

[58]  Wu-chun Feng,et al.  Demystifying automata processing: GPUs, FPGAs or Micron's AP? , 2017, ICS.

[59]  Wei Xu,et al.  High-Performance Multi-Mode Ptychography Reconstruction on Distributed GPUs , 2018, 2018 New York Scientific Data Summit (NYSDS).

[60]  Xiaodong Yu Deep packet inspection on large datasets : algorithmic and parallelization techniques for accelerating regular expression matching on many-core processors , 2013 .

[61]  J. Miao,et al.  Coherent X-Ray Diffraction Imaging , 2012, IEEE Journal of Selected Topics in Quantum Electronics.

[62]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[63]  W. Hoppe Beugung im inhomogenen Primärstrahlwellenfeld. I. Prinzip einer Phasenmessung von Elektronenbeungungsinterferenzen , 1969 .