Parallelization and characterization of SIFT on multi-core systems

This paper parallelizes and characterizes an important computer vision application -Scale Invariant Feature Transform (SIFT) both on a Symmetric Multiprocessor (SMP) platform and a large scale Chip Multiprocessor (CMP) simulator. SIFT is an approach for extracting distinctive invariant features from images and has been widely applied. In many computer vision problems, a real-time or even super-real-time processing capability of SIFT is required. To meet the computation demand, we optimize and parallelize SIFT to accelerate its execution on multi-core systems. Our study shows that SIFT can achieve a 9.7x ~ llx speedup on a 16 -core SMP system. Furthermore, Single Instruction Multiple Data (SIMD) and cache-conscious optimization bring another 85% performance gain at most. But it is still three times slower than the real-time requirement for High-Definition Television (HDTV) image. Then we study the performance of SIFT on a 64 -core CMP simulator. The results show that for HDTV image, SIFT can achieve an excellent speedup of 52 x and run in real-time finally. Besides the parallelization and optimization work, we also conduct a detailed performance analysis for SIFT on those two platforms. We find that load imbalance significantly limits the scalability and SIFT suffers from intensive burst memory bandwidth requirement on the 16 -core SMP system. However, on the 64 -core CMP simulator the memory pressure is not high due to the shared last-level cache (LLC) which accommodates tremendous read-write sharing in SIFT. Thus it does not affect the scaling performance. In short, understanding the characterization of SIFT can help identify the program bottlenecks and give us further insights into designing better systems.

[1]  Eftychios Sifakis,et al.  Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors , 2007, ISCA '07.

[2]  Jan-Michael Frahm,et al.  Feature tracking and matching in video using programmable graphics hardware , 2007, Machine Vision and Applications.

[3]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[4]  Qi Zhang,et al.  SIFT implementation and optimization for multi-core systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  Mahmut T. Kandemir,et al.  Organizing the last line of defense before hitting the memory wall for CMPs , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[6]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[7]  David G. Lowe,et al.  Scene modelling, recognition and tracking with invariant image features , 2004, Third IEEE and ACM International Symposium on Mixed and Augmented Reality.

[8]  Donald F. Towsley,et al.  The effectiveness of affinity-based scheduling in multiprocessor networking , 1996, Proceedings of IEEE INFOCOM '96. Conference on Computer Communications.

[9]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[10]  Calvin Lin,et al.  Memory Prefetching Using Adaptive Stream Detection , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Thomas Wiegand,et al.  SIFT Implementation and Optimization for General-Purpose GPU , 2007 .

[12]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[13]  Aamer Jaleel,et al.  Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[14]  Stamatis Vassiliadis,et al.  SIMD Vectorization of Histogram Functions , 2007, 2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP).