Efficient Execution of Irregular Wavefront Propagation Pattern on Many Integrated Core Architecture

The efficient execution of image processing algorithms is an active area of Bioinformatics. In image processing, one of the classes of algorithms or computing pattern that works with irregular data structures is the Irregular Wavefront Propagation Pattern (IWPP). In this class, elements propagate information to neighbors in the form of wave propagation. This propagation results in irregular access to data and expansions. Due to this irregularity, current implementations of this class of algorithms requires atomic operations, which is very costly and also restrains implementations with Single Instruction, Multiple Data (SIMD) instructions in Many Integrated Core (MIC) architectures, which are critical to attain high performance on this processor. The objective of this study is to redesign the Irregular Wavefront Propagation Pattern algorithm in order to enable the efficient execution on processors with Many Integrated Core architecture using SIMD instructions. In this work, using the Intel (R) Xeon Phi (TM) coprocessor, we have implemented a vector version of IWPP with up to 5.63x gains on non-vectored version, a parallel version using First In, First Out (FIFO) queue that attained speedup up to 55x as compared to the single core version on the coprocessor, a version using priority queue whose performance was 1.62x better than the fastest version of GPU based implementation available in the literature, and a cooperative version between heterogeneous processors that allow to process images bigger than the Intel (R) Xeon Phi (TM) memory and also provides a way to utilize all the available devices in the computation.

[1]  Jun Kong,et al.  Application performance analysis and efficient execution on systems with multi-core CPUs, GPUs and MICs: a case study with microscopy image analysis , 2017, Int. J. High Perform. Comput. Appl..

[2]  Jun Kong,et al.  High-performance computational analysis of glioblastoma pathology images with database support identifies molecular and survival correlates , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[3]  Jun Kong,et al.  Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines , 2013, Parallel Comput..

[4]  Joel H. Saltz,et al.  Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[5]  Rezaur Rahman Xeon Phi Vector Architecture and Instruction Set , 2013 .

[6]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[7]  David G. Kirkpatrick,et al.  Linear Time Euclidean Distance Algorithms , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Ümit V. Çatalyürek,et al.  Optimizing dataflow applications on heterogeneous environments , 2010, Cluster Computing.

[9]  Fusheng Wang,et al.  Automated cell segmentation with 3D fluorescence microscopy images , 2015, 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI).

[10]  Azriel Rosenfeld,et al.  Sequential Operations in Digital Picture Processing , 1966, JACM.

[11]  Jean Serra,et al.  Image Analysis and Mathematical Morphology , 1983 .

[12]  Q. Ye The signed Euclidean distance transform and its applications , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[13]  Ümit V. Çatalyürek,et al.  Run-time optimizations for replicated dataflows on heterogeneous environments , 2010, HPDC '10.

[14]  Jun Kong,et al.  Region templates: Data representation and management for high-throughput image analysis , 2014, Parallel Comput..

[15]  Nicholas Wilt,et al.  The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .

[16]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[17]  Jun Kong,et al.  Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[18]  Steve Carr,et al.  Race conditions: a case study , 2001 .

[19]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[20]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[21]  Jun Kong,et al.  Comparative Performance Analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: A Case Study from Microscopy Image Analysis , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[22]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[23]  Francisco de Assis Zampirolli,et al.  Transformada de distancia por morfologia matematica , 2003 .

[24]  David R. Musser,et al.  STL tutorial and reference guide, second edition: C++ programming with the standard template library , 2001 .

[25]  G. Borgefors Distance transformations in arbitrary dimensions , 1984 .

[26]  Wilhelm Burger,et al.  Digital Image Processing - An Algorithmic Introduction using Java , 2008, Texts in Computer Science.

[27]  Panagiotis Tzionas,et al.  A Parallel Skeletonization Algorithm Based on Two-Dimensional Cellular Automata and its VLSI Implementation , 1995, Real Time Imaging.

[28]  Roy Friedman,et al.  Shared memory consistency conditions for non-sequential execution: definitions and programming strategies , 1993, SPAA '93.

[29]  Pavel Karas Efficient Computation of Morphological Greyscale Reconstruction , 2010, MEMICS.

[30]  Ralph Duncan A survey of parallel computer architectures , 1990, Computer.

[31]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[32]  Joel H. Saltz,et al.  Machine-Based Morphologic Analysis of Glioblastoma Using Whole-Slide Pathology Images Uncovers Clinically Relevant Molecular Correlates , 2013, PloS one.

[33]  Luc Vincent,et al.  Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Frank Y. Shih,et al.  A skeletonization algorithm by maxima tracking on Euclidean distance transform , 1995, Pattern Recognit..

[35]  Stanley R Sternberg,et al.  Grayscale morphology , 1986 .

[36]  Satish Narayanasamy,et al.  Automatically classifying benign and harmful data races using replay analysis , 2007, PLDI '07.

[37]  Jun Kong,et al.  Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs , 2015, ArXiv.

[38]  Tiow Seng Tan,et al.  Variants of Jump Flooding Algorithm for Computing Discrete Voronoi Diagrams , 2007, 4th International Symposium on Voronoi Diagrams in Science and Engineering (ISVD 2007).

[39]  Jun Kong,et al.  A 3D Primary Vessel Reconstruction Framework with Serial Microscopy Images , 2015, MICCAI.

[40]  Jun Kong,et al.  A Fast Parallel Implementation of Queue-based Morphological Reconstruction using GPUs , 2012 .

[41]  Benoit M. Macq,et al.  Fast Euclidean Distance Transformation by Propagation Using Multiple Neighborhoods , 1999, Comput. Vis. Image Underst..

[42]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[43]  David Kaeli,et al.  Introduction to Parallel Programming , 2013 .

[44]  Luciano da Fontoura Costa,et al.  2D Euclidean distance transform algorithms: A comparative survey , 2008, CSUR.

[45]  Luc Vincent,et al.  Exact Euclidean distance function by chain propagations , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  Peter van Emde Boas,et al.  Design and implementation of an efficient priority queue , 1976, Mathematical systems theory.

[47]  Jun Kong,et al.  Feature-based analysis of large-scale spatio-temporal sensor data on hybrid architectures , 2013, Int. J. High Perform. Comput. Appl..

[48]  Jun Kong,et al.  Efficient Irregular Wavefront Propagation Algorithms on Intel(R) Xeon Phi(TM) , 2015, 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[49]  P. Danielsson Euclidean distance mapping , 1980 .

[50]  Pierre Soille,et al.  Morphological Image Analysis: Principles and Applications , 2003 .

[51]  Fernand Meyer Digital Euclidean skeletons , 1990, Other Conferences.

[52]  Wim H. Hesselink,et al.  A General Algorithm for Computing Distance Transforms in Linear Time , 2000, ISMM.

[53]  Jean Roman,et al.  Parallel Implementation of Morphological Connected Operators Based on Irregular Data Structures , 1998, VECPAR.

[54]  Jun Kong,et al.  High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[55]  Ivan Bogdanov,et al.  Image contrast enhancement using morphological decomposition by reconstruction , 2008 .

[56]  Luc M. Vincent,et al.  Efficient computation of various types of skeletons , 1991, Medical Imaging.

[57]  Metin Nafi Gürcan,et al.  Coordinating the use of GPU and CPU for improving performance of compute intensive applications , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[58]  Keshav Pingali,et al.  Atomic-free irregular computations on GPUs , 2013, GPGPU@ASPLOS.

[59]  Luc Vincent,et al.  Morphological grayscale reconstruction in image analysis: applications and efficient algorithms , 1993, IEEE Trans. Image Process..

[60]  Christian Lantuejoul,et al.  Skeletonization in Quantitative Metallography , 1980 .

[61]  Joel H. Saltz,et al.  Scalable analysis of Big pathology image data cohorts using efficient methods and high-performance computing strategies , 2015, BMC Bioinformatics.

[62]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[63]  L. Vincent Morphological Algorithms , 2018, Mathematical Morphology in Image Processing.