Structured Visual Feature Learning for Classification via Supervised Probabilistic Tensor Factorization

In this paper, structured visual feature learning aims at exploiting the intrinsic structural properties of mutually correlated multimedia collections (e.g., video frames or facial images) to learn a more effective feature representation for multimedia data classification. We pose structured visual feature learning as a problem of supervised tensor factorization (STF), which is capable of effectively learning multi-view visual features from structural tensorial multimedia data. In mathematics, STF is formulated as a joint optimization framework of probabilistic inference and ε-insensitive support vector regression. As a result, the feature representation obtained by STF not only preserves the intrinsic multi-view structural information on tensorial multimedia data, but also includes the discriminative information derived from the max-margin learning process. Using the learned discriminative visual features, we conduct a set of multimedia classification experiments on several challenging datasets, including images and videos, which demonstrate the effectiveness of our method.

[1]  James M. Rehg,et al.  Learning and Inferring Motion Patterns using Parametric Segmental Switching Linear Dynamic Systems , 2008, International Journal of Computer Vision.

[2]  Yueting Zhuang,et al.  Tensor-Based Transductive Learning for Multimodality Video Semantic Concept Detection , 2009, IEEE Transactions on Multimedia.

[3]  Jing Liu,et al.  Image annotation using multi-correlation probabilistic matrix factorization , 2010, ACM Multimedia.

[4]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[5]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Xuelong Li,et al.  Supervised Tensor Learning , 2005, ICDM.

[8]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Zi Huang,et al.  Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis , 2013, IEEE Transactions on Multimedia.

[10]  Yunde Jia,et al.  Non-negative matrix factorization framework for face recognition , 2005, Int. J. Pattern Recognit. Artif. Intell..

[11]  David J. Kriegman,et al.  Acquiring linear subspaces for face recognition under variable lighting , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Licia Capra,et al.  Temporal collaborative filtering with adaptive neighbourhoods , 2009, SIGIR.

[13]  Tamir Hazan,et al.  Sparse image coding using a 3D non-negative tensor factorization , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[14]  Johan A. K. Suykens,et al.  A kernel-based framework to tensorial data analysis , 2011, Neural Networks.

[15]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[16]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[17]  A. Banerjee,et al.  TR 11-026 Probabilistic Tensor Factorization for Tensor Completion , 2011 .

[18]  Changsheng Xu,et al.  User-Aware Image Tag Refinement via Ternary Semantic Analysis , 2012, IEEE Transactions on Multimedia.

[19]  Andrzej Cichocki,et al.  Nonnegative Tensor Factorization for Continuous EEG Classification , 2007, Int. J. Neural Syst..

[20]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[21]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[22]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[23]  Irene Kotsia,et al.  Support tucker machines , 2011, CVPR 2011.

[24]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[25]  Lars Schmidt-Thieme,et al.  Learning optimal ranking with tensor factorization for tag recommendation , 2009, KDD.

[26]  Andrew Zisserman,et al.  Learning Local Feature Descriptors Using Convex Optimisation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Lei Chen,et al.  Structure Tensor Series-Based Large Scale Near-Duplicate Video Retrieval , 2012, IEEE Transactions on Multimedia.

[28]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Zenglin Xu,et al.  Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis , 2011, ICML.

[30]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[31]  Constantine Kotropoulos,et al.  Music genre classification via Topology Preserving Non-Negative Tensor Factorization and sparse representations , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Xuelong Li,et al.  Human Carrying Status in Visual Surveillance , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Jun Zhu,et al.  Maximum Entropy Discrimination Markov Networks , 2009, J. Mach. Learn. Res..

[34]  Larry S. Davis,et al.  Discriminative Tensor Sparse Coding for Image Classification , 2013, BMVC.

[35]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[36]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[37]  Xu Tan,et al.  Supervised Nonnegative Tensor Factorization with Maximum-Margin Constraint , 2013, AAAI.

[38]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Tao Xiang,et al.  Learning Multimodal Latent Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Liqing Zhang,et al.  Multilinear and nonlinear generalizations of partial least squares: an overview of recent advances , 2014, WIREs Data Mining Knowl. Discov..

[41]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[42]  Z. Jane Wang,et al.  An Unsupervised Hierarchical Feature Learning Framework for One-Shot Image Recognition , 2013, IEEE Transactions on Multimedia.

[43]  Jieping Ye,et al.  Tensor Completion for Estimating Missing Values in Visual Data , 2013, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Lars Kai Hansen,et al.  Decomposing the time-frequency representation of EEG using non-negative matrix and multi-way factorization , 2006 .

[45]  Xi Chen,et al.  Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization , 2010, SDM.

[46]  Meng Wang,et al.  MSRA-MM 2.0: A Large-Scale Web Multimedia Dataset , 2009, 2009 IEEE International Conference on Data Mining Workshops.