Enabling low bitrate mobile visual recognition: a performance versus bandwidth evaluation

The rapid development of technologies in both hardware and software have made content-based multimedia services feasible on mobile devices such as smartphones and tablets; and the strong needs for mobile visual search and recognition have been emerging. While many real applications of visual recognition require a large scale recognition systems, the same technologies that support server-based scalable visual recognition may not be feasible on mobile devices due to the resource constraints. Although the client-server framework ensures the scalability, the real-time response subjects to the limitation on network bandwidth. Therefore, the main challenge for mobile visual recognition system should be the recognition bitrate, which is the amount of data transmission under the same recognition performance. For this work, we exploit and compare various strategies such as compact features, feature compression, feature signatures by hashing, image scaling, etc., to enable low bitrate mobile visual recognition. We argue that thumbnail image is a competitive candidate for low bitrate visual recognition because it carries multiple features at once and multi-feature fusion is important as the size of semantic space increases. Our evaluations on two subsets of ImageNet, both contain more than 10,000 images with 19 and 137 categories, verify the efficacy of thumbnail images. We further suggest a new strategy that combines single (local) feature signature and the thumbnail image, which achieves significant bitrate reduction from (average) 102,570 to 4,661 bytes with merely (overall) 10% performance degradation.

[1]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[4]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[5]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Bernd Girod,et al.  Compressed Histogram of Gradients: A Low-Bitrate Descriptor , 2011, International Journal of Computer Vision.

[8]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[9]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[10]  Florent Perronnin,et al.  High-dimensional signature compression for large-scale image classification , 2011, CVPR 2011.

[11]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[12]  Xin Yang,et al.  Accelerating SURF detector on mobile devices , 2012, ACM Multimedia.

[13]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[14]  Shih-Fu Chang,et al.  Mobile product search with Bag of Hash Bits and boundary reranking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[16]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[19]  Shih-Fu Chang,et al.  Sequential Projection Learning for Hashing with Compact Codes , 2010, ICML.

[20]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[21]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[22]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[23]  Bernd Girod,et al.  Mobile Visual Search , 2011, IEEE Signal Processing Magazine.

[24]  Xiaoyan Sun,et al.  IMShare: instantly sharing your mobile landmark images by search-based reconstruction , 2012, ACM Multimedia.

[25]  Tao Wang,et al.  Face Recognition using Feature of Integral Gabor-Haar Transformation , 2007, 2007 IEEE International Conference on Image Processing.

[26]  Bernd Girod,et al.  Comparison of local feature descriptors for mobile visual search , 2010, 2010 IEEE International Conference on Image Processing.

[27]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[28]  Cordelia Schmid,et al.  Evaluation of GIST descriptors for web-scale image search , 2009, CIVR '09.

[29]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[30]  Huizhong Chen,et al.  Residual Enhanced Visual Vectors for on-device image matching , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[31]  B. K. Julsing,et al.  Face Recognition with Local Binary Patterns , 2012 .

[32]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[33]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .