From pixels to semantics: visual concept detection and its applications; Från pixlar till semantik: detektion av visuella koncept samt tillämpningar

Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Mats Sjöberg Name of the doctoral dissertation From pixels to semantics: visual concept detection and its applications Publisher School of Science Unit Department of Information and Computer Science Series Aalto University publication series DOCTORAL DISSERTATIONS 156/2014 Field of research Computer and Information Science Manuscript submitted 11 June 2014 Date of the defence 25 November 2014 Permission to publish granted (date) 12 August 2014 Language English Monograph Article dissertation (summary + original articles) Abstract The amount of digital visual information available in the world today is enormous, and the rate at which more is continuously generated is simply unbelievable. For example YouTube gets 100 hours of new video every minute, and Facebook more than 350 million new photos every day. At best, this represents the creativity and knowledge of millions or even billions of people, made available to the entire world thanks to the Internet. The problem is of course: how do we find the "needle" that is relevant to us in this enormous "haystack"? Web search engines such as Google and Bing are decent solutions to find textual content, but finding relevant visual content is as yet an unsolved problem. The core issue is the semantic gap between the raw visual data processed by computers, and the abstract concepts and ideas humans use to communicate. This thesis studies one approach to this problem, namely using mid-level concepts to bridge the semantic gap. These semantic concepts are e.g. objects, locations, persons or events which are relatively concrete and thus comparatively easy to associate with the raw visual data. These can then be used to formulate more abstract queries, or used to index and further organise an image or video database. An overview of semantic concept detection using machine learning techniques is presented here, together with some applications. A central issue is keeping the computational speed and efficiency at a practical level for huge amounts of visual data, while still producing accurate and relevant results. To this end, this thesis studies several fast approximative versions of the popular Support Vector Machine (SVM) algorithm, and proposes some improvements to the fast Self-Organising Map (SOM) algorithm to improve its accuracy. Several large-scale realworld experimental applications are presented including image retrieval using social network tags, video search, indoor location recognition, and semantic visualisation of large image and video databases. The empirical evidence presented in this thesis shows that while the semantic gap problem is still not solved, the semantic concept approach produces concrete improvements to realworld applications. The improvements proposed and evaluated contribute to making the machine learning algorithms faster and thus more practically useful for processing huge amounts of visual data.The amount of digital visual information available in the world today is enormous, and the rate at which more is continuously generated is simply unbelievable. For example YouTube gets 100 hours of new video every minute, and Facebook more than 350 million new photos every day. At best, this represents the creativity and knowledge of millions or even billions of people, made available to the entire world thanks to the Internet. The problem is of course: how do we find the "needle" that is relevant to us in this enormous "haystack"? Web search engines such as Google and Bing are decent solutions to find textual content, but finding relevant visual content is as yet an unsolved problem. The core issue is the semantic gap between the raw visual data processed by computers, and the abstract concepts and ideas humans use to communicate. This thesis studies one approach to this problem, namely using mid-level concepts to bridge the semantic gap. These semantic concepts are e.g. objects, locations, persons or events which are relatively concrete and thus comparatively easy to associate with the raw visual data. These can then be used to formulate more abstract queries, or used to index and further organise an image or video database. An overview of semantic concept detection using machine learning techniques is presented here, together with some applications. A central issue is keeping the computational speed and efficiency at a practical level for huge amounts of visual data, while still producing accurate and relevant results. To this end, this thesis studies several fast approximative versions of the popular Support Vector Machine (SVM) algorithm, and proposes some improvements to the fast Self-Organising Map (SOM) algorithm to improve its accuracy. Several large-scale realworld experimental applications are presented including image retrieval using social network tags, video search, indoor location recognition, and semantic visualisation of large image and video databases. The empirical evidence presented in this thesis shows that while the semantic gap problem is still not solved, the semantic concept approach produces concrete improvements to realworld applications. The improvements proposed and evaluated contribute to making the machine learning algorithms faster and thus more practically useful for processing huge amounts of visual data.

[1]  Pasi Koikkalainen,et al.  Progress with the Tree-Structured Self-Organizing Map , 1994, ECAI.

[2]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[3]  Frédéric Jurie,et al.  Improving Image Classification Using Semantic Attributes , 2012, International Journal of Computer Vision.

[4]  Jorma Laaksonen,et al.  Combining Local Feature Histograms of Different Granularities , 2009, SCIA.

[5]  George Legrady Pockets Full of Memories: an interactive museum installation , 2002 .

[6]  W. Burgard,et al.  Markov Localization for Mobile Robots in Dynamic Environments , 1999, J. Artif. Intell. Res..

[7]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[8]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[9]  Pasi Koikkalainen,et al.  Self-organizing hierarchical feature maps , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[10]  B. Caputo,et al.  Cold: the Cosy Localization Database Cold: the Cosy Localization Database , 2009 .

[11]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[12]  Samuel Kaski,et al.  Self organization of a massive text document collection , 1999 .

[13]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Neural Networks , 2013 .

[14]  Hermann Ney,et al.  Features for image retrieval: an experimental comparison , 2008, Information Retrieval.

[15]  Alan F. Smeaton,et al.  TRECVid 2006 Experiments at Dublin City University , 2012, TRECVID.

[16]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[20]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[21]  Jorma Laaksonen,et al.  Improving the Accuracy of Global Feature Fusion Based Image Categorisation , 2007, SAMT.

[22]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[23]  Florent Perronnin,et al.  Large-scale image categorization with explicit data embedding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[25]  Tom Downs,et al.  Exact Simplification of Support Vector Solutions , 2002, J. Mach. Learn. Res..

[26]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[28]  Erkki Oja,et al.  PicSOM-self-organizing image retrieval with MPEG-7 content descriptors , 2002, IEEE Trans. Neural Networks.

[29]  Rong Yan,et al.  Video Retrieval Based on Semantic Concepts , 2008, Proceedings of the IEEE.

[30]  Thomas Sikora,et al.  The MPEG-7 visual standard for content description-an overview , 2001, IEEE Trans. Circuits Syst. Video Technol..

[31]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[32]  Martin Halvey,et al.  University of Glasgow at ImageCLEF 2009 Robot Vision Task: A Rule Based Approach , 2009, CLEF.

[33]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[34]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[35]  Rong Yan,et al.  Semantic concept-based query expansion and re-ranking for multimedia retrieval , 2007, ACM Multimedia.

[36]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[37]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[38]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[40]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[41]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[42]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[43]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[44]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[45]  Jianxin Wu,et al.  Power mean SVM for large scale visual classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Michael J. Witbrock,et al.  An Introduction to the Syntax and Content of Cyc , 2006, AAAI Spring Symposium: Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering.

[47]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[48]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[49]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[50]  Chia-Hua Ho,et al.  Recent Advances of Large-Scale Linear Classification , 2012, Proceedings of the IEEE.

[51]  Markus Koskela,et al.  Content-Based Image Retrieval with Self-Organizing Maps , 1999 .

[52]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[53]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[54]  Jorma Laaksonen,et al.  PicSOM Experiments in TRECVID 2018 , 2015, TRECVID.

[55]  Jorma Laaksonen,et al.  Experiments on Selection of Codebooks for Local Image Feature Histograms , 2008, VISUAL.

[56]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[57]  Shih-Fu Chang,et al.  Automatic discovery of query-class-dependent models for multimodal search , 2005, MULTIMEDIA '05.

[58]  Elias Pampalk,et al.  Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps , 2002, ICANN.

[59]  Cordelia Schmid,et al.  Semantic Hierarchies for Visual Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Barbara Caputo,et al.  SVM-based discriminative accumulation scheme for place recognition , 2008, 2008 IEEE International Conference on Robotics and Automation.

[61]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[62]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[63]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[64]  Steven K. Feiner,et al.  A touring machine: Prototyping 3D mobile augmented reality systems for exploring the urban environment , 1997, Digest of Papers. First International Symposium on Wearable Computers.

[65]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[66]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[67]  Jorma Laaksonen,et al.  Real-time large-scale visual concept detection with linear classifiers , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[68]  Arnold W. M. Smeulders,et al.  Visual-Concept Search Solved? , 2010, Computer.

[69]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[70]  Stéphane Ayache,et al.  Image and Video Indexing Using Networks of Operators , 2007, EURASIP J. Image Video Process..

[71]  S. Vereza Philosophy in the flesh: the embodied mind and its challenge to Western thought , 2001 .

[72]  Georges Quénot,et al.  Conceptual feedback for semantic multimedia indexing , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[73]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[74]  Koen E. A. van de Sande,et al.  Empowering Visual Categorization With the GPU , 2011, IEEE Transactions on Multimedia.

[75]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[76]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[77]  Stevan Harnad The Symbol Grounding Problem , 1999, ArXiv.

[78]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[79]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[80]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[81]  Erkki Oja,et al.  Implementing Relevance Feedback as Convolutions of Local Neighborhoods on Self-Organizing Maps , 2002, ICANN.

[82]  John P. Eakins,et al.  Towards intelligent image retrieval , 2002, Pattern Recognit..

[83]  Stevan Harnad,et al.  How is Meaning Grounded in Dictionary Definitions? , 2008, COLING 2008.

[84]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[85]  Erkki Oja,et al.  Class distributions on SOM surfaces for feature extraction and object retrieval , 2004, Neural Networks.

[86]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[87]  Jorma Laaksonen,et al.  Concept-based Video Search with the PicSOM Multimedia Retrieval System , 2010 .

[88]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[89]  James J. Little,et al.  Mobile Robot Localization and Mapping with Uncertainty using Scale-Invariant Visual Landmarks , 2002, Int. J. Robotics Res..

[90]  Marcel Worring,et al.  Are Concept Detector Lexicons Effective for Video Search? , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[91]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[92]  Jun Yang,et al.  (Un)Reliability of video concept detection , 2008, CIVR '08.

[93]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[94]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[95]  Alexander G. Hauptmann,et al.  LSCOM Lexicon Definitions and Annotations (Version 1.0) , 2006 .

[96]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[97]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[98]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[99]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[100]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[101]  Tieniu Tan,et al.  Feature Coding in Image Classification: A Comprehensive Study , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[102]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[103]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[104]  Markus Schedl,et al.  A naive mid-level concept-based fusion approach to violence detection in Hollywood movies , 2013, ICMR '13.