Comparison of different strategies for a SVM-based audio segmentation

We compare in this paper diverse hierarchical and multi-class approaches for the speech/music segmentation task, based on Support Vector Machines, combined with a median filter post-processing. We show the effciency of kernel tuning through the novel Kernel Target Alignment criterion. Quantitative results provide an F-measure of 96.9%, that represents an error reduction of about 50% compared to the results gathered by the French ESTER evaluation campaign. We also show the relevance of the SVM with very low feature vector dimension on this task.

[1]  Lie Lu,et al.  Content-based audio segmentation using support vector machines , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[2]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[4]  Daniel P. W. Ellis,et al.  Speech/music discrimination based on posterior probability features , 1999, EUROSPEECH.

[5]  Xavier Rodet,et al.  HIERARCHICAL GAUSSIAN TREE WITH INERTIA RATIO MAXIMIZATION FOR THE CLASSIFICATION OF LARGE MUSICAL INSTRUMENT DATABASES , 2003 .

[6]  Gaël Richard,et al.  Combined Supervised and Unsupervised Approaches for Automatic Segmentation of Radiophonic Audio Streams , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[8]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[9]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[10]  Dan Istrate,et al.  Broadcast news speaker tracking for ESTER 2005 campaign , 2005, INTERSPEECH.

[11]  Cédric Richard,et al.  A greedy algorithm for optimizing the kernel alignment and the performance of kernel machines , 2006, 2006 14th European Signal Processing Conference.

[12]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.