Combining high productivity and high performance in image processing using Single Assignment C on multi-core CPUs and many-core GPUs

We address the challenge of parallelization development of industrial high-performance inspection systems comparing a conventional parallelization approach versus an auto-parallelized technique. Therefore, we introduce the functional array processing language Single Assignment C (SAC), which relies on a hardware virtualization concept for automated, parallel machine code generation for multi-core CPUs and GPUs. Additional software engineering aspects like programmability, productivity, understandability, maintainability, and resulting achieved gain in performance are discussed from the point of view of a developer. With several illustrative benchmarking examples from the field of image processing and machine learning, the relationship between runtime performance and efficiency of development is analyzed.

[1]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[2]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[3]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[4]  Clemens Grelck,et al.  SAC—A Functional Array Language for Efficient Multi-threaded Execution , 2006, International Journal of Parallel Programming.

[5]  Sven-Bodo Scholz,et al.  Breaking the GPU programming barrier with the auto-parallelising SAC compiler , 2011, DAMP '11.

[6]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[9]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[10]  Clemens Grelck,et al.  Sac - From High-Level Programming with Arrays to Efficient Parallel Execution , 2003, Parallel Process. Lett..

[11]  K. Hawick,et al.  Stencil Methods on Distributed High Performance Computers , 2022 .

[12]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[13]  Jitendra Malik,et al.  Scale-Space and Edge Detection Using Anisotropic Diffusion , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Clemens Grelck,et al.  Shared memory multiprocessor support for functional array processing in SAC , 2005, J. Funct. Program..

[15]  Sven-Bodo Scholz,et al.  Towards Compiling SAC to CUDA , 2009, Trends in Functional Programming.

[16]  Clemens Grelck,et al.  Axis Control in SAC , 2002, IFL.

[17]  Clemens Grelck,et al.  Compiling the functional data-parallel language SaC for Microgrids of Self-Adaptive Virtual Processors , 2009 .

[18]  Clemens Grelck,et al.  Merging Compositions of Array Skeletons in SAC , 2005, PARCO.

[19]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[20]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[21]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures , 2009, IWOMP.

[22]  Tarek S. Abdelrahman,et al.  hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[23]  Bernhard Schölkopf,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[24]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[25]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .