Automatic classification of personal video recordings based on audiovisual features

A system for automatic classification of personal videos is presented.Personal video recordings are classified into 24 categories frame by frame.Features derived from both audio and image data are used to classify.The system learns the most appropriate parameters and classifiers for the task. The aim of the present work is to design a system for automatic classification of personal video recordings based on simple audiovisual features that can be easily implemented in different devices. Specifically, the main objective is to classify frame by frame personal video recordings into 24 semantically meaningful categories. Such categories include information about the environment like indoor or outdoor, the presence or absence of people and their activity, ranging from sports to partying. In order to achieve a robust classification, features derived from both audio and image data will be used and combined with state of the art classifiers such as Gaussian Mixture Models or Support Vector Machines. In the process, several combination schemes of features and classifiers are defined and evaluated over a real data set of personal video recordings. The system learns which parameters and classifiers are most appropriate for this task.The experiments show that the approach using specific classifiers for audio features (Mel-Frequency Cepstral Coefficients (MFCCs)) and image features (color, edge histograms), and using a meta-classification combination schema attains significant performance. The best performance obtained over the different approaches evaluated gave a promising f - measure larger than 57% in average for all the categories and larger than 73% over diverse categories.

[1]  Arding Hsu,et al.  Image processing on compressed data for large video databases , 1993, MULTIMEDIA '93.

[2]  Moncef Gabbouj,et al.  Sport Type Classification of Mobile Videos , 2014, IEEE Transactions on Multimedia.

[3]  Cheng Wang,et al.  Using Stacked Generalization to Combine SVMs in Magnitude and Shape Feature Spaces for Classification of Hyperspectral Data , 2009, IEEE Transactions on Geoscience and Remote Sensing.

[4]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[5]  Xian-Sheng Hua,et al.  Multi-modality web video categorization , 2007, MIR '07.

[6]  J. T. Foote,et al.  "Content-Based Retrieval of Music and Audio," Multimedia Storage and Archiving System II , 1997 .

[7]  Kaare Brandt Petersen,et al.  Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music , 2006, ISMIR.

[8]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[9]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[10]  HongJiang Zhang,et al.  Motion Pattern-Based Video Classification and Retrieval , 2003, EURASIP J. Adv. Signal Process..

[11]  Mark A Gregory,et al.  A novel approach for MFCC feature extraction , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[12]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[13]  Ying Li,et al.  SVM-based audio classification for instructional video analysis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Ning Zhang,et al.  An efficient framework on large-scale video genre classification , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[15]  Sergios Theodoridis,et al.  Pattern Recognition, Fourth Edition , 2008 .

[16]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[17]  Qiang Ji,et al.  Video Affective Content Analysis: A Survey of State-of-the-Art Methods , 2015, IEEE Transactions on Affective Computing.

[18]  Jeroen Breebaart,et al.  Features for Audio Classification , 2004 .

[19]  Denis Pellerin,et al.  Video classification based on low-level feature fusion model , 2005, 2005 13th European Signal Processing Conference.

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Shigeyuki Sakazawa,et al.  Genre classification method for home videos , 2009, 2009 IEEE International Workshop on Multimedia Signal Processing.

[22]  Chang-Su Kim,et al.  Automatic Video Genre Classification Using Multiple SVM Votes , 2014, 2014 22nd International Conference on Pattern Recognition.

[23]  Tao Mei,et al.  Automatic Video Genre Categorization using Hierarchical SVM , 2006, 2006 International Conference on Image Processing.

[24]  Zhong Ming,et al.  SVM-Based Video Scene Classification and Segmentation , 2008, 2008 International Conference on Multimedia and Ubiquitous Engineering (mue 2008).

[25]  Karen Spärck Jones,et al.  Automatic content-based retrieval of broadcast news , 1995, MULTIMEDIA '95.

[26]  Ling Shao,et al.  Content-based retrieval of human actions from realistic video databases , 2013, Inf. Sci..

[27]  B. Yegnanarayana,et al.  Combining multiple evidence for video classification , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[28]  Daniel P. W. Ellis,et al.  Audio-Based Semantic Concept Classification for Consumer Video , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[30]  Jiebo Luo,et al.  Large-scale multimodal semantic concept detection for consumer video , 2007, MIR '07.

[31]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[32]  Yongmin Li,et al.  Video classification using spatial-temporal features and PCA , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[33]  C. Won,et al.  Efficient Use of MPEG‐7 Edge Histogram Descriptor , 2002 .

[34]  Grigorios Tsoumakas,et al.  Correlation-Based Pruning of Stacked Binary Relevance Models for Multi-Label Learning , 2009 .

[35]  Lei Zhang,et al.  A CBIR method based on color-spatial feature , 1999, Proceedings of IEEE. IEEE Region 10 Conference. TENCON 99. 'Multimedia Technology for Asia-Pacific Information Infrastructure' (Cat. No.99CH37030).

[36]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[37]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[38]  Y. Mitsukura,et al.  Analysis of the interaction between image and music focused on physical features , 2012, 2012 Proceedings of SICE Annual Conference (SICE).

[39]  Georges Linarès,et al.  Audio-Based Video Genre Identification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[41]  Yu Song,et al.  Feature extraction and classification for audio information in news video , 2009, 2009 International Conference on Wavelet Analysis and Pattern Recognition.

[42]  Mingchun Liu,et al.  A study on content-based classification and retrieval of audio database , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[43]  John S. D. Mason,et al.  Classification of video genre using audio , 2001, INTERSPEECH.

[44]  Ling Shao,et al.  Action recognition by spatio-temporal oriented energies , 2014, Inf. Sci..

[45]  Xiaoqing Feng,et al.  Multimodal video classification with stacked contractive autoencoders , 2016, Signal Process..

[46]  Nicu Sebe,et al.  GLocal tells you more: Coupling GLocal structural for feature selection with sparsity for image and video classification , 2014, Comput. Vis. Image Underst..