Improving voice activity detection in movies

Voice Activity Detection in movies is a non-trivial and challenging task. The different emotional states of the speakers, as well as the variety of soundscapes and noises contribute to the complexity of the task. In this paper, we propose a set of lightweight features that are specifically designed to perform under such conditions, while at the same time preventing confusions of singing voice with speech. For evaluation, we use four fulllength movies, previously unseen to the system and painstakingly annotated. We compare our detector to a state-of-the-art reference system. The new approach performs better, yielding just about half the Equal Error Rate (EER). Furthermore, since the ground truth annotation task is extremely tedious, and to help with advancing in this topic, we release the annotations of all four movies to the research community. Index Terms: Voice Activity Detection, Speech Detection

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  A. Gray,et al.  A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis , 1974 .

[3]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[4]  Lie Lu,et al.  Music type classification by spectral contrast feature , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[5]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[6]  Kornel Laskowski,et al.  Modeling instantaneous intonation for speaker identification using the fundamental frequency variation spectrum , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Rong Tong,et al.  Chinese Dialect Identification Using Tone Features Based on Pitch Flux , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[9]  Mohammad Hossein Moattar,et al.  A simple but efficient real-time Voice Activity Detection algorithm , 2009, 2009 17th European Signal Processing Conference.

[10]  P. Fränti,et al.  645 Improving Speaker Verification by Periodicity Based Voice Activity Detection , .

[11]  Gerhard Widmer,et al.  A SIMPLE AND EFFECTIVE SPECTRAL FEATURE FOR SPEECH DETECTION IN MIXED AUDIO SIGNALS , 2012 .

[12]  Petros Maragos,et al.  Speech event detection using multiband modulation energy , 2005, INTERSPEECH.

[13]  Joan Serrà,et al.  Shape-based spectral contrast descriptor , 2009 .

[14]  Zdravko Kacic,et al.  A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm , 2001, INTERSPEECH.

[15]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[16]  Gerhard Widmer,et al.  On the reduction of false positives in singing voice detection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Milan Sigmund,et al.  Impact of vocal effort variability on automatic speech recognition , 2012, Speech Commun..

[19]  I. Cohen,et al.  AR-GARCH in Presence of Noise: Parameter Estimation and Its Application to Voice Activity Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[21]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.