Large-scale interactive exploratory visual search

Large scale visual search has been one of the challenging issues in the era of big data. It demands techniques that are not only highly effective and efficient but also allow users conveniently express their information needs and refine their intents. In this thesis, we focus on developing an exploratory framework for large scale visual search, since users often have difficulty in clearly formulating the intent-specific queries for visual content, unlike keyword based textual information retrieval. Towards this goal, we also develop a number of enabling techniques in this thesis, including compact visual content representation for scalable search, near duplicate video shot detection, and action based event detection at a fine level. Although the bag-of-visual-words model has been successfully utilized for scalable visual information retrieval, the large number of visual words obtained from a query image results in a large amount of data to be uploaded to search engines, which significant affects the response time of a query. This problem becomes more serious for mobile users due to limited bandwidth while the mobile and cloud architecture has become more and more popular. Therefore, we propose a novel scheme for extremely low bit rate visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. This scheme can reduce the bit rate of a visual query with few extra computational costs on a mobile client. Specifically, we store a vocabulary tree and extract visual descriptors on the mobile client. A light-weight pre-retrieval is performed to obtain the visited leaf nodes in the vocabulary tree. Compact representation of video data is achieved through identifying keyframes of a video which can also help users comprehend visual content efficiently. We propose a novel Bag-of-Importance (BoI) model for static video summarization by identifying the frames with important local features as keyframes. In addition, we propose to learn a transformation from a raw local feature to a more powerful sparse non-linear representation for deriving the importance score of each local feature, rather than directly utilize the hand-crafted visual features like most of the existing approaches.

[1]  Marco Pellegrini,et al.  STIMO: STIll and MOving video storyboard for the web scenario , 2009, Multimedia Tools and Applications.

[2]  Manfred del Fabro,et al.  Global vs . Local Feature in Video Summarization : Experimental Results , 2009 .

[3]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[4]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Antonio Torralba,et al.  Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[7]  Guillermo Sapiro,et al.  Classification and clustering via dictionary learning with structured incoherence and shared features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Ramesh C. Jain,et al.  Image annotation by kNN-sparse graph-based label propagation over noisily tagged web images , 2011, TIST.

[10]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Bernd Girod,et al.  Mobile Visual Search , 2011, IEEE Signal Processing Magazine.

[12]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[13]  Qionghai Dai,et al.  Similarity-based online feature selection in content-based image retrieval , 2006, IEEE Transactions on Image Processing.

[14]  Bernd Girod,et al.  Fast geometric re-ranking for image-based retrieval , 2010, 2010 IEEE International Conference on Image Processing.

[15]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[16]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[17]  Chiou-Ting Hsu,et al.  Fusion of audio and motion information on HMM-based highlight extraction for baseball games , 2006, IEEE Transactions on Multimedia.

[18]  Aggelos K. Katsaggelos,et al.  MINMAX optimal video summarization , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Ravin Balakrishnan,et al.  Keepin' it real: pushing the desktop metaphor with physics, piles and the pen , 2006, CHI.

[20]  Mark Nelson,et al.  The Data Compression Book, 2nd Edition , 1996 .

[21]  Shih-Fu Chang,et al.  Detecting image near-duplicate by stochastic attributed relational graph matching with learning , 2004, MULTIMEDIA '04.

[22]  Huizhong Chen,et al.  The stanford mobile visual search data set , 2011, MMSys.

[23]  William Jones Personal Information Management , 2007, Annu. Rev. Inf. Sci. Technol..

[24]  Ryen W. White,et al.  Exploratory Search: Beyond the Query-Response Paradigm , 2009, Exploratory Search: Beyond the Query-Response Paradigm.

[25]  Chong-Wah Ngo,et al.  Scale-Rotation Invariant Pattern Entropy for Keypoint-Based Near-Duplicate Detection , 2009, IEEE Transactions on Image Processing.

[26]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[27]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Hanqing Lu,et al.  Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Rafael C. González,et al.  Digital image processing, 3rd Edition , 2008 .

[30]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.

[31]  Shaohui Mei,et al.  Video Summarization with Global and Local Features , 2012, 2012 IEEE International Conference on Multimedia and Expo Workshops.

[32]  Thomas S. Huang,et al.  Relevance feedback in image retrieval: A comprehensive review , 2003, Multimedia Systems.

[33]  Tom Drummond,et al.  Machine Learning for High-Speed Corner Detection , 2006, ECCV.

[34]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[35]  Yi Li,et al.  ARISTA - image search to annotation on billions of web photos , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[37]  Ramesh C. Jain NSF workshop on Visual Information Management Systems , 1993, SGMD.

[38]  Nenghai Yu,et al.  Complementary hashing for approximate nearest neighbor search , 2011, 2011 International Conference on Computer Vision.

[39]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[40]  Shaogang Gong,et al.  Discriminative Topics Modelling for Action Feature Selection and Recognition , 2010, BMVC.

[41]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[42]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[43]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Shiyang Lu,et al.  Keypoint-Based Keyframe Selection , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[45]  Yong Rui,et al.  Image search—from thousands to billions in 20 years , 2013, TOMCCAP.

[46]  Lei Zhang,et al.  A Unified Relevance Feedback Framework for Web Image Retrieval , 2009, IEEE Transactions on Image Processing.

[47]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[48]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[49]  Vincent Lepetit,et al.  Are sparse representations really relevant for image classification? , 2011, CVPR 2011.

[50]  Athman Bouguettaya,et al.  An Efficient Near-Duplicate Video Shot Detection Method Using Shot-Based Interest Points , 2009, IEEE Transactions on Multimedia.

[51]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[53]  Yang Song,et al.  Tour the world: Building a web-scale landmark recognition engine , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Bernd Girod,et al.  Transform coding of image feature descriptors , 2009, Electronic Imaging.

[55]  Hao Xu,et al.  Image search by concept map , 2010, SIGIR '10.

[56]  Luis E. Ortiz,et al.  Parsing clothing in fashion photographs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Ting Sun,et al.  Single-pixel imaging via compressive sampling , 2008, IEEE Signal Process. Mag..

[58]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[59]  Avideh Zakhor,et al.  Fast similarity search and clustering of video sequences on the world-wide-web , 2005, IEEE Transactions on Multimedia.

[60]  George Economou,et al.  Combining graph connectivity & dominant set clustering for video summarization , 2009, Multimedia Tools and Applications.

[61]  Andrew Zisserman,et al.  An Affine Invariant Salient Region Detector , 2004, ECCV.

[62]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[63]  Liqing Zhang,et al.  MindFinder: interactive sketch-based image search on millions of images , 2010, ACM Multimedia.

[64]  Fred Henrik Hamker,et al.  Predictions of a Model of Spatial Attention Using Sum- and Max-Pooling Functions , 2002, Neurocomputing.

[65]  Richard P. Wildes,et al.  Efficient action spotting based on a spacetime oriented structure representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  Yun Fu,et al.  Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition , 2010, ACCV.

[67]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[69]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[70]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[71]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Zi Huang,et al.  Near-duplicate video retrieval: Current research and future trends , 2013, CSUR.

[73]  Alexei A. Efros,et al.  IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Bernd Girod,et al.  CHoG: Compressed histogram of gradients A low bit-rate feature descriptor , 2009, CVPR.

[75]  Bernd Girod,et al.  Mobile product recognition , 2010, ACM Multimedia.

[76]  Georgios Tziritas,et al.  Equivalent Key Frames Selection Based on Iso-Content Principles , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[77]  Neil Yorke-Smith,et al.  PTIME: Personalized assistance for calendaring , 2011, TIST.

[78]  Wei-Ying Ma,et al.  Image annotation by large-scale content-based image retrieval , 2006, MM '06.

[79]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[80]  Rong Xiao,et al.  Rank-SIFT: Learning to rank repeatable local interest points , 2011, CVPR 2011.

[81]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[82]  Xiao Li,et al.  Understanding the Semantic Structure of Noun Phrase Queries , 2010, ACL.

[83]  Wen Gao,et al.  Towards low bit rate mobile visual search with multiple-channel coding , 2011, ACM Multimedia.

[84]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[85]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[86]  Gary Marchionini,et al.  Finding facts vs. browsing knowledge in hypertext systems , 1988, Computer.

[87]  Tao Mei,et al.  CrowdReranking: exploring multiple search engines for visual search reranking , 2009, SIGIR.

[88]  Jiebo Luo,et al.  Towards Extracting Semantically Meaningful Key Frames From Personal Video Clips: From Humans to Computers , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[89]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[91]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[92]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[93]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[94]  Zhongfei Zhang,et al.  Effective Image Retrieval Based on Hidden Concept Discovery in Image Database , 2007, IEEE Transactions on Image Processing.

[95]  Thomas S. Huang,et al.  Supervised translation-invariant sparse coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[96]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[97]  Luc Van Gool,et al.  Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors , 2011, CVPR 2011.

[98]  Andrew M. Webb,et al.  combinFormation: Mixed-initiative composition of image and text surrogates promotes information discovery , 2008, TOIS.

[99]  Bernd Girod,et al.  Tree Histogram Coding for Mobile Image Matching , 2009, 2009 Data Compression Conference.

[100]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[101]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[102]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[103]  Matthieu Cord,et al.  Active Learning Methods for Interactive Image Retrieval , 2008, IEEE Transactions on Image Processing.

[104]  Yang Yang,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, CVPR.

[105]  Junsong Yuan,et al.  Abnormal event detection in crowded scenes using sparse representation , 2013, Pattern Recognit..

[106]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[107]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[108]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[109]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[110]  Shiliang Zhang,et al.  Learning attribute-aware dictionary for image classification and search , 2013, ICMR.

[111]  B. S. Manjunath,et al.  Cortina: a system for large-scale, content-based web image retrieval , 2004, MULTIMEDIA '04.

[112]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[113]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[114]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[115]  Ryen W. White,et al.  Supporting exploratory search , 2006 .

[116]  Qiang Yang,et al.  Building bridges for web query classification , 2006, SIGIR.

[117]  Yan Liu,et al.  Joint discriminative dimensionality reduction and dictionary learning for face recognition , 2013, Pattern Recognit..

[118]  Xin Liu,et al.  Video summarization using singular value decomposition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[119]  Meng Wang,et al.  Visual query suggestion , 2009, ACM Multimedia.

[120]  Tao Mei,et al.  A bag-of-importance model for video summarization , 2013, 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[121]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[122]  Rongrong Ji,et al.  Active query sensing for mobile location search , 2011, ACM Multimedia.

[123]  Wei-Ying Ma,et al.  AnnoSearch: Image Auto-Annotation by Search , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[124]  David S. Doermann,et al.  Video summarization by curve simplification , 1998, MULTIMEDIA '98.

[125]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[126]  Enhong Chen,et al.  Context-aware query classification , 2009, SIGIR.

[127]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[128]  Chong-Wah Ngo,et al.  Near-duplicate keyframe retrieval with visual keywords and semantic context , 2007, CIVR '07.

[129]  Jhing-Fa Wang,et al.  A Novel Video Summarization Based on Mining the Story-Structure and Semantic Relations Among Concept Entities , 2009, IEEE Transactions on Multimedia.

[130]  Xian-Sheng Hua,et al.  Contextual image search , 2011, ACM Multimedia.

[131]  Richard Szeliski,et al.  City-Scale Location Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[132]  Marcel van Herk A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels , 1992, Pattern Recognit. Lett..

[133]  Peyman Milanfar,et al.  Action Recognition from One Example , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[134]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[135]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[136]  Gary Marchionini,et al.  The Open Video Digital Library , 2002, D Lib Mag..

[137]  Shipeng Li,et al.  Query-driven iterated neighborhood graph search for large scale indexing , 2012, ACM Multimedia.

[138]  Shi-Kuo Chang,et al.  An Intelligent Image Database System , 1988, IEEE Trans. Software Eng..