Factor analysis for audio-based video genre classification

Statistical classifiers operate on features that generally include both useful and useless information. These two types of information are difficult to separate in the feature domain. Recently, a new paradigm based on a Latent Factor Analysis (LFA) proposed a model decomposition into usefull and useless components. This method was successfully applied to speaker and language recognition tasks. In this paper, we study the use of LFA for video genre classification by using only the audio channel. We propose a classification method based on short-term cep-stral features and Gaussian Mixture Models (GMM) or Support Vector Machine (SVM) classifiers, that are combined with Factor Analysis (FA). Experiments are conducted on a corpus composed of 5 types of video (musics, commercials, cartoons, movies and news). The relative classification error reduction obtained by using the best factor analysis configuration with respect to the baseline system, Gaussian Mixture Model Universal Background Model (GMM-UBM), is about 56%, corresponding to a correct identification rate of about 90%.

[1]  Roach Matthew Classification of Non-edited Broadcast Video Using Holistic Low-level Features , 2002 .

[2]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[3]  Jean-François Bonastre,et al.  ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  John S. D. Mason,et al.  Classification of video genre using audio , 2001, INTERSPEECH.

[5]  Shih-Fu Chang,et al.  Survey of compressed-domain features used in audio-visual indexing and analysis , 2003, J. Vis. Commun. Image Represent..

[6]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[7]  R.S. Jasinschi,et al.  Automatic TV program genre classification based on audio patterns , 2001, Proceedings 27th EUROMICRO Conference. 2001: A Net Odyssey.

[8]  Yongmin Li,et al.  Video classification using spatial-temporal features and PCA , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[9]  Driss Matrouf,et al.  A straightforward and efficient implementation of the factor analysis model for speaker verification , 2007, INTERSPEECH.

[10]  Diane J. Cook,et al.  Automatic Video Classification: A Survey of the Literature , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[11]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[12]  Svetha Venkatesh,et al.  Horror film genre typing and scene labeling via audio analysis , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[13]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.