Segmentation of Telephone Speech Based on Speech and Non-speech Models

In this paper we investigate the automatic segmentation of recorded telephone conversations based on models for speech and non-speech to find sentence-like chunks for use in speech recognition systems. Presented are two different approaches, based on Gaussian Mixture Models GMMs and Support Vector Machines SVMs, respectively. The proposed methods provide segmentations that allow for competitive speech recognition performance in terms of word error rate WER compared to manual segmentation.

[1]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  Ugo Montanari,et al.  International Symposium on Programming , 1982, Lecture Notes in Computer Science.

[3]  Fernando Perdigão,et al.  Speech event detection using SVM and NMD , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Andrey Temko,et al.  Enhanced SVM Training for Robust Speech Activity Detection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Helmut Mangold Sprachliche Mensch-Maschine-Kommunikation , 1992 .

[7]  Juan Manuel Górriz,et al.  SVM-based speech endpoint detection using contextual speech features , 2006 .

[8]  Alexander H. Waibel,et al.  Temporal ICA for classification of acoustic events i a kitchen environment , 2005, INTERSPEECH.

[9]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[10]  Sebastian Stüker,et al.  The 2011 KIT QUAERO speech-to-text system for Spanish , 2011, IWSLT.

[11]  DeLiang Wang,et al.  An SVM based classification approach to speech separation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Richard M. Schwartz,et al.  Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Sebastian Stüker,et al.  The ISL 2007 English speech transcription system for european parliament speeches , 2007, INTERSPEECH.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Tanja Schultz,et al.  The ISL RT04 Mandarin Broadcast News Evaluation System , 2004 .

[16]  Hermann Ney,et al.  Large vocabulary continuous speech recognition of Broadcast News - The Philips/RWTH approach , 2002, Speech Commun..

[17]  P. Fränti,et al.  Voice Activity Detection Using MFCC Features and Support Vector Machine , 2007 .

[18]  Dong Enqing,et al.  Applying support vector machines to voice activity detection , 2002, 6th International Conference on Signal Processing, 2002..