Automatic Detection of Disfluency Boundaries in Spontaneous Speech of Children Using Audio–Visual Information

The presence of disfluencies in spontaneous speech, while poses a challenge for robust automatic recognition, also offers means for gaining additional insights into understanding a speaker's communicative and cognitive state. This paper analyzes disfluencies in children's spontaneous speech, in the context of spoken dialog based computer game play, and addresses the automatic detection of disfluency boundaries. Although several approaches have been proposed to detect disfluencies in speech, relatively little work has been done to utilize visual information to improve the performance and robustness of the disfluency detection system. This paper describes the use of visual information along with prosodic and language information to detect the presence of disfluencies in a child's computer-directed speech and shows how these information sources can be integrated to increase the overall information available for disfluency detection. The experimental results on our children's multimodal dialog corpus indicate that disfluency detection accuracy of over 80% can be obtained by utilizing audio-visual information. Specifically, results showed that the addition of visual information to prosody and language features yield relative improvements in disfluency detection error rates of 3.6% and 6.3%, respectively, for information fusion at the feature level and decision level.

[1]  Mary P. Harper,et al.  Gesture patterns during speech repairs , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[2]  Sharon Oviatt,et al.  Designing and evaluating conversational interfaces with animated characters , 2001 .

[3]  Keith Langley,et al.  Recursive Filters for Optical Flow , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Shrikanth S. Narayanan,et al.  Automatic detection and classification of disfluent reading miscues in young children's speech for the purpose of assessment , 2007, INTERSPEECH.

[5]  Ronald A. Cole,et al.  CU animate tools for enabling conversations with animated characters , 2002, INTERSPEECH.

[6]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[7]  Satanjeev Banerjee,et al.  Evaluating the effect of predicting oral reading miscues , 2003, INTERSPEECH.

[8]  Sean Martin,et al.  Analysis and Detection of Reading Miscues for Interactive Literacy Tutors , 2004, COLING.

[9]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Josef Kittler,et al.  Combining multiple classifiers by averaging or by multiplying? , 2000, Pattern Recognit..

[11]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[13]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[14]  Shrikanth Narayanan,et al.  Analyzing the interplay between spoken language and gestural cues in conversational child-machine interactions in pre/early literate age groups , 2004 .

[15]  C. Clifford,et al.  Measuring the structure of dynamic visual signals , 2002, Animal Behaviour.

[16]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[17]  Carlos Busso,et al.  Interrelation Between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Elizabeth Shriberg DISFLUENCIES IN SWITCHBOARD , 1996 .

[19]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[20]  Yukiko I. Nakano,et al.  Non-Verbal Cues for Discourse Structure , 2022 .

[21]  David J. Fleet,et al.  Performance of optical flow techniques , 1994, International Journal of Computer Vision.

[22]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.

[23]  Rashid Ansari,et al.  Multimodal human discourse: gesture and speech , 2002, TCHI.

[24]  Mary P. Harper,et al.  Multimodal model integration for sentence unit detection , 2004, ICMI '04.

[25]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[26]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[27]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[28]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Qun Li,et al.  An analysis of the causes of increased error rates in children²s speech recognition , 2002, INTERSPEECH.

[30]  Francis K. H. Quek,et al.  Catchments, prosody and discourse , 2001 .

[31]  Andreas Stolcke,et al.  A prosody only decision-tree model for disfluency detection , 1997, EUROSPEECH.

[32]  Shrikanth S. Narayanan,et al.  Creating conversational interfaces for children , 2002, IEEE Trans. Speech Audio Process..

[33]  Andreas Stolcke,et al.  Statistical language modeling for speech disfluencies , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[35]  David G. Stork,et al.  Pattern Classification , 1973 .

[36]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[37]  Mohammed Yeasin,et al.  Prosody based audiovisual coanalysis for coverbal gesture recognition , 2005, IEEE Transactions on Multimedia.

[38]  Andreas Stolcke,et al.  Automatic disfluency identification in conversational speech using multiple knowledge sources , 2003, INTERSPEECH.

[39]  Shrikanth S. Narayanan,et al.  A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  Sharon L. Oviatt,et al.  Predicting spoken disfluencies during human-computer interaction , 1995, Comput. Speech Lang..

[41]  Michael Kipp,et al.  ANVIL - a generic annotation tool for multimodal dialogue , 2001, INTERSPEECH.