Efficient Algorithms for the Summed Area Tables Primitive on GPUs

Two-dimensional Summed Area Tables (SAT) is a fundamental primitive used in image processing and machine learning applications. We present a collection of optimization methods for computing SAT on CUDA-enabled GPUs. Conventional approaches rely on computing the prefix sum in one dimension in parallel, transposing the matrix, then computing the prefix sum for the other dimension in parallel. Additionally, conventional methods use the scratchpad memory as cache. We propose a collection of algorithms that are scalable with respect to problem size. We use the register cache technique instead of the scratchpad memory and also employ a naive serial scan on the thread level for computing the prefix sum for one of the dimensions. Using a novel transpose-in-registers method we increase the inter-thread parallelism and outperform conventional SAT implementations. In addition, we significantly reduce both the communication between threads and the number of arithmetic instructions. On an Nvidia Pascal P100 GPU and Volta V100, our evaluations demonstrate that our implementations outperform state of the art libraries and yield up to 2.3x and 3.2x speedup over OpenCV and Nvidia NPP libraries, respectively.

[1]  Robert Laganière,et al.  Fast LBP Face Detection on Low-Power SIMD Architectures , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[2]  Libor Preucil,et al.  FPGA based Speeded Up Robust Features , 2009, 2009 IEEE International Conference on Technologies for Practical Robot Applications.

[3]  John D. Owens,et al.  Register packing for cyclic reduction: a case study , 2011, GPGPU-4.

[4]  Guna Seetharaman,et al.  Efficient GPU Implementation of the Integral Histogram , 2012, ACCV Workshops.

[5]  Derek Bradley,et al.  Adaptive Thresholding using the Integral Image , 2007, J. Graph. Tools.

[6]  Tack-Don Han,et al.  A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment , 2013, IEEE Transactions on Parallel and Distributed Systems.

[7]  Martin Burtscher,et al.  Higher-order and tuple-based massively-parallel prefix sums , 2016, PLDI.

[8]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[9]  Xavier Martorell,et al.  Real-time GPU-based face detection in HD video sequences , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[10]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[11]  G. Bradski,et al.  詳解OpenCV : コンピュータビジョンライブラリを使った画像処理・認識 , 2009 .

[12]  Desanka Polajnar,et al.  Local binary pattern network: A deep learning approach for face recognition , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[13]  Jiří Machač Intel Integrated Performance Primitives a jejich využití při vývoji aplikací , 2008 .

[14]  Youngbae Hwang,et al.  Memory-efficient SURF architecture for ASIC implementation , 2014 .

[15]  Jean-Pierre Dérutin,et al.  SIMD, SMP and MIMD-DM parallel approaches for real-time 2D image stabilization , 2005, Seventh International Workshop on Computer Architecture for Machine Perception (CAMP'05).

[16]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[17]  H. T. Kung,et al.  A Regular Layout for Parallel Adders , 1982, IEEE Transactions on Computers.

[18]  Yu Wei,et al.  FPGA implementation of AdaBoost algorithm for detection of face biometrics , 2004, IEEE International Workshop on Biomedical Circuits and Systems, 2004..

[19]  Deming Chen,et al.  A novel SoC architecture on FPGA for ultra fast face detection , 2009, 2009 IEEE International Conference on Computer Design.

[20]  Jaime S. Cardoso,et al.  Deep Local Binary Patterns , 2017, ArXiv.

[21]  Yongchao Liu,et al.  LightScan: Faster Scan Primitive on CUDA Compatible Manycore Processors , 2016, ArXiv.

[22]  Margarita Amor,et al.  Efficient Scan Operator Methods on a GPU , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[23]  M. Hosseinzadeh,et al.  Fast Overflow Detection in Moduli Set {2 n - 1, 2 n , 2 n + 1} , 2011 .

[24]  Narayanan Vijaykrishnan,et al.  A parallel architecture for hardware face detection , 2006, IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06).

[25]  Hirotaka Tamura,et al.  Fast algorithm using summed area tables with unified layer performing convolution and average pooling , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[26]  Rodolfo S. Lima,et al.  GPU-efficient recursive filtering and summed-area tables , 2011, SA '11.

[27]  Ichiro Masaki,et al.  Efficient integral image computation on the GPU , 2010, 2010 IEEE Intelligent Vehicles Symposium.

[28]  Bohyung Han,et al.  Bayesian Filtering and Integral Image for Visual Tracking , 2005 .

[29]  Diederik Verkest,et al.  Real-time high-definition stereo matching on FPGA , 2011, FPGA '11.

[30]  Akihiko Kasagi,et al.  Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations , 2014, 2014 43rd International Conference on Parallel Processing.

[31]  Xinxin Mei,et al.  Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.

[32]  J. P. Lewis,et al.  Fast Template Matching , 2009 .

[33]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[34]  Klaus D. McDonald-Maier,et al.  Integral Images: Efficient Algorithms for Their Computation and Storage in Resource-Constrained Embedded Vision Systems , 2015, Sensors.

[35]  Pablo Enfedaque,et al.  Implementation of the DWT in a GPU through a Register-based Strategy , 2015, IEEE Transactions on Parallel and Distributed Systems.

[36]  Shengen Yan,et al.  A fast integral image generation algorithm on GPUs , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[37]  Jeff Nichols,et al.  Announcing Supercomputer Summit , 2016 .

[38]  Anselmo Lastra,et al.  Fast Summed‐Area Table Generation and its Applications , 2005, Comput. Graph. Forum.

[39]  Guna Seetharaman,et al.  Fast Integral Histogram Computations on GPU for Real-Time Video Analytics , 2017, ArXiv.

[40]  Vinod Nair,et al.  An FPGA-Based People Detection System , 2005, EURASIP J. Adv. Signal Process..

[41]  Franklin C. Crow,et al.  Summed-area tables for texture mapping , 1984, SIGGRAPH.

[42]  Shuicheng Yan,et al.  An HOG-LBP human detector with partial occlusion handling , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[43]  Satoshi Matsuoka The Road to TSUBAME and Beyond , 2008 .

[44]  Satoshi Matsuoka Being "BYTES-oriented" in HPC leads to an open big data/AI ecosystem and further advances into the post-moore era , 2017, BigData.

[45]  Anselmo Lastra,et al.  Fast HDR Image-Based Lighting Using Summed-Area Tables , 2006 .

[46]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[47]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[48]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[49]  Manuel M. Oliveira Real-Time Photographic Local Tone Reproduction Using Summed-Area Tables , 2008 .

[50]  Raul Queiroz Feitosa,et al.  Real-Time Object Tracking in High-Definition Video Using Frame Segmentation and Background Integral Images , 2013, 2013 XXVI Conference on Graphics, Patterns and Images.

[51]  Christopher H. Messom,et al.  Stream Processing of Geometric and Central Moments Using High Precision Summed Area Tables , 2008, ICONIP.

[52]  Alexander Toet,et al.  Speed-up Template Matching through Integral Image based Weak Classifiers , 2014 .