Fusion in Computer Vision

We propose a novel multimodal approach to automatically predict the visual concepts of images through an effective fusion of visual and textual features. It relies on a Selective Weighted Late Fusion (SWLF) scheme which, in optimizing an overall Mean interpolated Average Precision (MiAP), learns to automatically select and weight the best features for each visual concept to be recognized. Experiments were conducted on the MIR Flickr image collection within the ImageCLEF Photo Annotation challenge. The results have brought to the fore the effectiveness of SWLF as it achieved a MiAP of 43.69 % in 2011 which ranked second out of the 79 submitted runs, and a MiAP of 43.67 % that ranked first out of the 80 submitted runs in 2012.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Hugo Jair Multimodal indexing based on semantic cohesion for image retrieval , 2012 .

[3]  Lorenzo Bruzzone,et al.  Classification of hyperspectral remote sensing images with support vector machines , 2004, IEEE Transactions on Geoscience and Remote Sensing.

[4]  Yu-Gang Jiang,et al.  SUPER: towards real-time event recognition in internet videos , 2012, ICMR.

[5]  Fuad Rahman,et al.  Serial Combination of Multiple Experts: A Unified Evaluation , 1999, Pattern Analysis & Applications.

[6]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jon Atli Benediktsson,et al.  The effect of classifier agreement on the accuracy of the combined classifier in decision level fusion , 2001, IEEE Trans. Geosci. Remote. Sens..

[8]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cor J. Veenman,et al.  Comparing compact codebooks for visual categorization , 2010, Comput. Vis. Image Underst..

[10]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[11]  Jon Atli Benediktsson,et al.  Sensitivity of Support Vector Machines to Random Feature Selection in Classification of Hyperspectral Data , 2010, IEEE Transactions on Geoscience and Remote Sensing.

[12]  Peijun Du,et al.  Hyperspectral Remote Sensing Image Classification Based on Rotation Forest , 2014, IEEE Geoscience and Remote Sensing Letters.

[13]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[14]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[15]  Masashi Sugiyama,et al.  Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis , 2007, J. Mach. Learn. Res..

[16]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Juan José Rodríguez Diez,et al.  An Experimental Study on Rotation Forest Ensembles , 2007, MCS.

[19]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[20]  Mario Chica-Olmo,et al.  An assessment of the effectiveness of a random forest classifier for land-cover classification , 2012 .

[21]  Djoerd Hiemstra,et al.  A Probabilistic Multimedia Retrieval Model and Its Evaluation , 2003, EURASIP J. Adv. Signal Process..

[22]  Marcel Worring,et al.  On the surplus value of semantic video analysis beyond the key frame , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[23]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[24]  Gabriela Csurka,et al.  Crossing textual and visual content in different application scenarios , 2009, Multimedia Tools and Applications.

[25]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[26]  Nicu Sebe,et al.  Time matters!: capturing variation in time in video using fisher kernels , 2013, MM '13.

[27]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[28]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[29]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[30]  Arif Gülten,et al.  Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms , 2011, Comput. Methods Programs Biomed..

[31]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[32]  James Ze Wang,et al.  Content-based image retrieval: approaches and trends of the new age , 2005, MIR '05.

[33]  Arnold W. M. Smeulders,et al.  Visual-Concept Search Solved? , 2010, Computer.

[34]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[35]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[36]  Kevin W. Bowyer,et al.  Combination of Multiple Classifiers Using Local Accuracy Estimates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[38]  V. Jayaraman,et al.  Remote sensing applications : An overview , 2007 .

[39]  P. Valdez,et al.  Effects of color on emotions. , 1994, Journal of experimental psychology. General.

[40]  Liming Chen,et al.  Multi-scale Color Local Binary Patterns for Visual Object Classes Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[41]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[42]  D. Ruta,et al.  An Overview of Classifier Fusion Methods , 2000 .

[43]  Trevor Darrell,et al.  Detection bank: an object detection based video representation for multimedia event recognition , 2012, ACM Multimedia.

[44]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[45]  Anil K. Jain,et al.  Large-scale evaluation of multimodal biometric authentication using state-of-the-art systems , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Hideyuki Tamura,et al.  Textural Features Corresponding to Visual Perception , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[47]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[50]  Yiannis Kompatsiaris,et al.  High-level event detection in video exploiting discriminant concepts , 2011, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI).

[51]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[52]  Matti Pietikäinen,et al.  A comparative study of texture measures with classification based on featured distributions , 1996, Pattern Recognit..

[53]  Jenny Benois-Pineau,et al.  Strategies for multiple feature fusion with Hierarchical HMM: Application to activity recognition from wearable audiovisual sensors , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[54]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[55]  Gonzalo Martínez-Muñoz,et al.  Switching class labels to generate classification ensembles , 2005, Pattern Recognit..

[56]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[57]  Murat Akbacak,et al.  KDDI LABS and SRI International at TRECVID 2010: Content-Based Copy Detection , 2010, TRECVID.

[58]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Giles M. Foody,et al.  Feature Selection for Classification of Hyperspectral Data by SVM , 2010, IEEE Transactions on Geoscience and Remote Sensing.

[60]  Aleksandra Mojsilovic,et al.  Semantic-Friendly Indexing and Quering of Images Based on the Extraction of the Objective Semantic Cues , 2004, International Journal of Computer Vision.

[61]  Thomas Cudahy,et al.  Mapping white micas and their absorption wavelengths using hyperspectral band ratios , 2006 .

[62]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[63]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  J. R. Sveinsson,et al.  Mapping of hyperspectral AVIRIS data using machine-learning algorithms , 2009 .

[66]  James J. Chen,et al.  Ensemble methods for classification of patients for personalized medicine with high-dimensional data , 2007, Artif. Intell. Medicine.

[67]  Gabriela Csurka,et al.  An empirical study of fusion operators for multimodal image retrieval , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[68]  Vasile Palade,et al.  Multi-Classifier Systems: Review and a roadmap for developers , 2006, Int. J. Hybrid Intell. Syst..

[69]  Liming Chen,et al.  Line segment based edge feature using Hough transform , 2007 .

[70]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[71]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[72]  Ludmila I. Kuncheva Diversity in multiple classifier systems , 2005, Inf. Fusion.

[73]  A. G. Amitha Perera,et al.  GENIE TRECVID 2011 Multimedia Event Detection: Late-Fusion Approaches to Combine Multiple Audio-Visual features , 2011, TRECVID.

[74]  Grigorios Tsoumakas,et al.  An Ensemble Pruning Primer , 2009, Applications of Supervised and Unsupervised Ensemble Methods.

[75]  Paul C. Smits,et al.  Multiple classifier systems for supervised remote sensing image classification based on dynamic classifier selection , 2002, IEEE Trans. Geosci. Remote. Sens..

[76]  Luisa Micó,et al.  Comparison of Classifier Fusion Methods for Classification in Pattern Recognition Tasks , 2006, SSPR/SPR.

[77]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[78]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[79]  Samy Bengio,et al.  Large-scale content-based audio retrieval from text queries , 2008, MIR '08.

[80]  Brian M. Steele,et al.  Combining Multiple Classifiers: An Application Using Spatial and Remotely Sensed Information for Land Cover Type Mapping , 2000 .

[81]  Qihao Weng,et al.  A survey of image classification methods and techniques for improving classification performance , 2007 .

[82]  Ramakant Nevatia,et al.  Large-scale web video event classification by use of Fisher Vectors , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[83]  Ramakant Nevatia,et al.  Evaluating multimedia features and fusion for example-based event detection , 2013, Machine Vision and Applications.

[84]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[85]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[86]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[87]  Derek Partridge,et al.  Software Diversity: Practical Statistics for Its Measurement and Exploitation | Draft Currently under Revision , 1996 .

[88]  Robert P. W. Duin,et al.  Dimensionality Reduction of Hyperspectral Data via Spectral Feature Extraction , 2009, IEEE Transactions on Geoscience and Remote Sensing.

[89]  Antonio J. Plaza,et al.  Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations , 2005, IEEE Transactions on Geoscience and Remote Sensing.

[90]  Hassiba Nemmour,et al.  Multiple support vector machines for land cover change detection: An application for mapping urban extensions , 2006 .

[91]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Andreas Stolcke,et al.  The ICSI-SRI Spring 2006 Meeting Recognition System , 2006, MLMI.

[93]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[94]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[95]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[96]  Alberto Del Bimbo,et al.  Event detection and recognition for semantic annotation of video , 2010, Multimedia Tools and Applications.

[97]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[98]  Dong Xu,et al.  Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[99]  M.,et al.  Statistical and Structural Approaches to Texture , 2022 .

[100]  Pao-Ta Yu,et al.  A Dynamic Subspace Method for Hyperspectral Image Classification , 2010, IEEE Transactions on Geoscience and Remote Sensing.

[101]  Rick L. Lawrence,et al.  Classification of remotely sensed imagery using stochastic gradient boosting as a refinement of classification tree analysis , 2004 .

[102]  Murat Akbacak,et al.  Bag-of-Audio-Words Approach for Multimedia Event Classification , 2012, INTERSPEECH.

[103]  H. Joel Trussell,et al.  Dimensionality reduction in hyperspectral image classification , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[104]  Ye Xiu A New Combination Rules of Evidence Theory , 2000 .

[105]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[106]  Lior Rokach,et al.  Pattern Classification Using Ensemble Methods , 2009, Series in Machine Perception and Artificial Intelligence.

[107]  David A. Landgrebe,et al.  Signal Theory Methods in Multispectral Remote Sensing , 2003 .

[108]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[109]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[110]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[111]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.