Filled Pauses and Lengthenings Detection Based on the Acoustic Features for the Spontaneous Russian Speech

The spontaneous speech processing has a number of problems. Among them there are speech disfluencies. Although most of them are easily treated by speakers and usually do not cause any difficulties for understanding, for Automatic Speech Recognition (ASR) systems their appearance lead to many recognition mistakes. Our paper deals with the most frequent of them (filled pauses and sound lengthenings) basing on the analysis of their acoustical parameters. The method based on the autocorrelation function was used to detect voiced hesitation phenomena and a method of band-filtering was used to detect unvoiced hesitation phenomena. For the experiments on filled pauses and lengthenings detection an especially collected corpus of spontaneous Russian map-task and appointment-task dialogs was used. The accuracy of voiced filled pauses and lengthening detection was 80%. And accuracy of detection of unvoiced fricative lengthening was 66%.

[1]  Josef Psutka,et al.  Czech Broadcast Conversation Speech , 2009 .

[2]  Masataka Goto,et al.  A real-time filled pause detection system for spontaneous speech recognition , 1999, EUROSPEECH.

[3]  Andreas Stolcke,et al.  Enriching speech recognition with automatic detection of sentence boundaries and disfluencies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Matthew Trinkle,et al.  Automatic Detection and Removal of Disfluencies from Spontaneous Speech , 2010 .

[5]  Jan Svec,et al.  Czech spontaneous speech corpus with structural metadata , 2005, INTERSPEECH.

[6]  Andrey Ronzhin,et al.  Speech and Computer , 2013, Lecture Notes in Computer Science.

[7]  D. Nelson Correlation based speech formant recovery , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Matthew Lease,et al.  Recognizing disfluencies in conversational speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jan Švec,et al.  Czech Broadcast Conversation MDE Transcripts , 2009 .

[10]  Klaus J. Kohler,et al.  Labelled data bank of spoken standard German: the Kiel corpus of read/spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Helena Moniz,et al.  Disfluency detection based on prosodic features for university lectures , 2013, INTERSPEECH.

[12]  H. H. Clark,et al.  Using uh and um in spontaneous speaking , 2002, Cognition.

[13]  Andrey Ronzhin,et al.  Large vocabulary Russian speech recognition using syntactico-statistical language modeling , 2014, Speech Commun..

[14]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[15]  Fernando Perdigão,et al.  Characterization of Hesitations Using Acoustic Models , 2011, ICPhS.

[16]  Vasilisa Verkhodanova,et al.  Automatic Detection of Speech Disfluencies in the Spontaneous Russian Speech , 2013, SPECOM.

[17]  Роман Валерьевич Мещеряков,et al.  Структура и база данных программного обеспечения оценки качества и разборчивости речи в процессе реабилитации после операции при лечении рака полости рта и ротоглотки, челюстнолицевой области , 2014 .

[18]  Alexey Karpov,et al.  Analysis of long-distance word dependencies and pronunciation variability at conversational Russian speech recognition , 2012, 2012 Federated Conference on Computer Science and Information Systems (FedCSIS).

[19]  Mary P. Harper,et al.  Structural event detection for rich transcription of speech , 2004 .