Feature Correlation Hypergraph: Exploiting High-order Potentials for Multimodal Recognition

In computer vision and multimedia analysis, it is common to use multiple features (or multimodal features) to represent an object. For example, to well characterize a natural scene image, we typically extract a set of visual features to represent its color, texture, and shape. However, it is challenging to integrate multimodal features optimally. Since they are usually high-order correlated, e.g., the histogram of gradient (HOG), bag of scale invariant feature transform descriptors, and wavelets are closely related because they collaboratively reflect the image texture. Nevertheless, the existing algorithms fail to capture the high-order correlation among multimodal features. To solve this problem, we present a new multimodal feature integration framework. Particularly, we first define a new measure to capture the high-order correlation among the multimodal features, which can be deemed as a direct extension of the previous binary correlation. Therefore, we construct a feature correlation hypergraph (FCH) to model the high-order relations among multimodal features. Finally, a clustering algorithm is performed on FCH to group the original multimodal features into a set of partitions. Moreover, a multiclass boosting strategy is developed to obtain a strong classifier by combining the weak classifiers learned from each partition. The experimental results on seven popular datasets show the effectiveness of our approach.

[1]  Xiao Liu,et al.  Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[3]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[4]  Xiaoli Zhou,et al.  Feature fusion of side face and gait for video-based human identification , 2008, Pattern Recognit..

[5]  Nicolas Pinto,et al.  Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[6]  Bernt Schiele,et al.  Analyzing appearance and contour based methods for object categorization , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[7]  Xuelong Li,et al.  Supervised Gaussian Process Latent Variable Model for Dimensionality Reduction , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  C. Schmid,et al.  Scale-invariant shape features for recognition of object categories , 2004, CVPR 2004.

[9]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[10]  Andrew E. Johnson,et al.  Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[12]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[13]  James Llinas,et al.  An introduction to multisensor data fusion , 1997, Proc. IEEE.

[14]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[15]  Kaizhu Huang,et al.  m-SNE: Multiview Stochastic Neighbor Embedding , 2011, IEEE Trans. Syst. Man Cybern. Part B.

[16]  Benoît Maison,et al.  Joint processing of audio and visual information for multimedia indexing and human-computer interaction , 2000, RIAO.

[17]  Yue Gao,et al.  Tag-based social image search with visual-text joint hypergraph learning , 2011, ACM Multimedia.

[18]  David Zhang,et al.  Post-processed LDA for face and palmprint recognition: What is the rationale , 2010, Signal Process..

[19]  Dacheng Tao,et al.  Max-Min Distance Analysis by Using Sequential SDP Relaxation for Dimension Reduction , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Amnon Shashua,et al.  Latent Model Clustering and Applications to Visual Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[22]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[23]  Jieping Ye,et al.  Hypergraph spectral learning for multi-label classification , 2008, KDD.

[24]  Mau-Tsuen Yang,et al.  A multimodal fusion system for people detection and tracking , 2005, Int. J. Imaging Syst. Technol..

[25]  Yue Gao,et al.  3-D Object Retrieval and Recognition With Hypergraph Analysis , 2012, IEEE Transactions on Image Processing.

[26]  Xuelong Li,et al.  Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.

[27]  Fatih Murat Porikli,et al.  Human Detection via Classification on Riemannian Manifolds , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Derek Greene,et al.  A Matrix Factorization Approach for Integrating Multiple Data Views , 2009, ECML/PKDD.

[29]  Chee Sun Won,et al.  Efficient use of local edge histogram descriptor , 2000, MULTIMEDIA '00.

[30]  Huan Liu,et al.  Multi-Source Feature Selection via Geometry-Dependent Covariance Analysis , 2008, FSDM.

[31]  Xiao Liu,et al.  Probabilistic Graphlet Transfer for Photo Cropping , 2013, IEEE Transactions on Image Processing.

[32]  Ramesh Jain,et al.  Experiential Sampling for video surveillance , 2003, IWVS '03.

[33]  Kevin W. Bowyer,et al.  Combination of multiple classifiers using local accuracy estimates , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Marimuthu Palaniswami,et al.  A Note on Octonionic Support Vector Regression , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[36]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[37]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[39]  B. S. Manjunath,et al.  Texture Features for Browsing and Retrieval of Image Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[42]  Nello Cristianini,et al.  Support Vector Machines and Kernel Methods: The New Generation of Learning Machines , 2002, AI Mag..

[43]  Mingjing Li,et al.  Color texture moments for content-based image retrieval , 2002, Proceedings. International Conference on Image Processing.

[44]  S. V. N. Vishwanathan,et al.  Multiple Kernel Learning and the SMO Algorithm , 2010, NIPS.

[45]  Qingshan Liu,et al.  Image retrieval via probabilistic hypergraph ranking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  Chun Chen,et al.  Music recommendation by unified hypergraph: combining social media information and music content , 2010, ACM Multimedia.

[47]  Meng Wang,et al.  Adaptive Hypergraph Learning and its Application in Image Classification , 2012, IEEE Transactions on Image Processing.

[48]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[49]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[50]  Mohan S. Kankanhalli,et al.  Temporal encoded F-formation system for social interaction detection , 2013, ACM Multimedia.

[51]  Dimitris N. Metaxas,et al.  ]Video object segmentation by hypergraph cut , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Bing Li,et al.  Efficient Clustering Aggregation Based on Data Fragments , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[53]  Yongdong Zhang,et al.  Multiview Spectral Embedding , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[54]  Meng Wang,et al.  MSRA-MM 2.0: A Large-Scale Web Multimedia Dataset , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[55]  Tim J. Ellis,et al.  ViHASi: Virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods , 2008, ICDSC.

[56]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[57]  Philip S. Yu,et al.  Community Learning by Graph Approximation , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[58]  Gian Luca Foresti,et al.  A distributed sensor network for video surveillance of outdoor environments , 2002, Proceedings. International Conference on Image Processing.

[59]  Yi Yang,et al.  Discovering Discriminative Graphlets for Aerial Image Categories Recognition , 2013, IEEE Transactions on Image Processing.

[60]  Mohan S. Kankanhalli,et al.  Experiential Sampling in Multimedia Systems , 2006, IEEE Transactions on Multimedia.