Impact of acoustical voice activity detection on spontaneous filled pause classification

Filled pause detection is imperative for spontaneous speech recognition as it may degrade speech recognition rate. However, filled pause is commonly confused with elongation as they shared the same acoustical properties. Few attempts of classifying filled pause and elongation employed Hidden Markov model. Our proposed method of utilizing Neural Network as a classifier achieved 96% precision rate. We also proved that voice activity detection (VAD) affects the performance of speech recognition. Three acoustical-based VAD are compared and the best precision rate is achieved by incorporating volume and first-order difference features. Experiments are conducted using Malay language spontaneous speeches of Malaysia Parliamentary Debate sessions.

[1]  Noraini Seman,et al.  Acoustical Analysis of Filled Pause in Malay Spontaneous Speech , 2012, FGIT-FGCN/DCA.

[2]  Jean-Pierre Martens,et al.  A feature-based filled pause detection system for Dutch , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[3]  F. Rosdi,et al.  Isolated malay speech recognition using Hidden Markov Models , 2008, 2008 International Conference on Computer and Communication Engineering.

[4]  Matthew Trinkle,et al.  Automatic Detection and Removal of Disfluencies from Spontaneous Speech , 2010 .

[5]  Guoqiang Peter Zhang,et al.  Neural networks for classification: a survey , 2000, IEEE Trans. Syst. Man Cybern. Part C.

[6]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[7]  Nigel Ward,et al.  Detecting Filled Pauses in Tutorial Dialogs , 2006 .

[8]  Masataka Goto,et al.  A real-time filled pause detection system for spontaneous speech recognition , 1999, EUROSPEECH.

[9]  Abdul Rashid Mohamed,et al.  A Comparative Analysis of Word Structures in Malay and English Children's Stories , 2013 .

[10]  Nordin Abu Bakar,et al.  Measuring the performance of isolated spoken Malay speech recognition using Multi-layer Neural Networks , 2010, 2010 International Conference on Science and Social Research (CSSR 2010).

[11]  A. Hussain,et al.  Hierarchical K-Means Algorithm Applied On Isolated Malay Digit Speech Recognit ion , 2012 .

[12]  Hiroshi G. Okuno,et al.  Automatic speech recognition improved by two-layered audio-visual integration for robot audition , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[13]  A. Ghaffari,et al.  Performance comparison of neural network training algorithms in modeling of bimodal drug delivery. , 2006, International journal of pharmaceutics.

[14]  Tao Li,et al.  A Novel Detection Method of Filled Pause in Mandarin Spontaneous Speech , 2008, Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008).

[15]  Ashish Verma,et al.  Formant-based technique for automatic filled-pause detection in spontaneous spoken english , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Nikos Fakotakis,et al.  Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .

[17]  M. Ross,et al.  Average magnitude difference function pitch extractor , 1974 .