Speech/Nonspeech Segmentation in Web Videos

Speech transcription of web videos requires first detecting segments with transcribable speech. We refer to this as segmentation. Commonly used segmentation techniques are inadequate for domains such as YouTube, where videos may have a large variety of background and recording conditions. In this work, we investigate alternative audio features and a discriminative classifier, which together yield a lower frame error rate (25.3%) on YouTube videos compared to the commonly used Gaussian mixture models trained on cepstral features (30.6%). The alternative audio features perform particularly well in noisy conditions.

[1]  Masakiyo Fujimoto,et al.  Study of integration of statistical model-based voice activity detection and noise suppression , 2008, INTERSPEECH.

[2]  Peder A. Olsen,et al.  Voicing features for robust speech detection , 2005, INTERSPEECH.

[3]  Trausti T. Kristjansson,et al.  DySANA: dynamic speech and noise adaptation for voice activity detection , 2008, INTERSPEECH.

[4]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[5]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[6]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  Eduardo Lleida,et al.  Hierarchical Audio Segmentation with HMM and Factor Analysis in Broadcast News Domain , 2011, INTERSPEECH.

[8]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[9]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[10]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Yoshihiko Nankaku,et al.  Voice activity detection based on conditional random fields using multiple features , 2010, INTERSPEECH.

[13]  Thomas Hain,et al.  Segmentation and classification of broadcast news audio , 1998, ICSLP.

[14]  Werner Verhelst,et al.  On Noise Robust Voice Activity Detection , 2011, INTERSPEECH.

[15]  Thierry Bazillon,et al.  Speaker diarization of heterogeneous web video files: A preliminary study , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Amit Srivastava,et al.  Online Speech Activity Detection in Broadcast News , 2011, INTERSPEECH.