Spontaneous speech: how people really talk and why engineers should care

Spontaneous conversation is optimized for human-human communication, but differs in some important ways from the types of speech for which human language technology is often developed. This overview describes four fundamental properties of spontaneousspeech that present challenges for spoken language applications because they violate assumptions often applied in automatic processing technology.

[1]  Andreas Stolcke,et al.  Using machine learning to cope with imbalanced classes in natural speech: evidence from sentence boundary and disfluency detection , 2004, INTERSPEECH.

[2]  Andreas Stolcke,et al.  Improving Automatic Sentence Boundary Detection with Confusion Networks , 2004, NAACL.

[3]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[4]  Rosalind W. Picard,et al.  Classical and novel discriminant features for affect recognition from speech , 2005, INTERSPEECH.

[5]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[6]  Douglas A. Reynolds,et al.  Measuring human readability of machine generated text: three case studies in speech recognition and machine translation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[8]  Diane J. Litman,et al.  Using word-level pitch features to better predict student emotions during spoken tutoring dialogues , 2005, INTERSPEECH.

[9]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[10]  Andreas Stolcke,et al.  Comparing and Combining Generative and Posterior Probability Models: Some Advances in Sentence Boundary Detection in Speech , 2004, EMNLP.

[11]  Johanneke Caspers,et al.  Local speech melody as a limiting factor in the turn-taking system in Dutch , 2003, J. Phonetics.

[12]  Andreas Stolcke,et al.  SRI's 2004 NIST speaker recognition evaluation system , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Andreas Stolcke,et al.  Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[14]  Eugene Charniak,et al.  A TAG-based noisy-channel model of speech repairs , 2004, ACL.

[15]  Ji-Hwan Kim,et al.  A combined punctuation generation and speech recognition system and its performance enhancement using prosody , 2003, Speech Commun..

[16]  James F. Allen,et al.  Speech repains, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue , 1999, CL.

[17]  Lin-Shan Lee,et al.  Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features , 2005, INTERSPEECH.

[18]  Andreas Stolcke,et al.  Comparing HMM, maximum entropy, and conditional random fields for disfluency detection , 2005, INTERSPEECH.

[19]  Klaus Zechner,et al.  Automatic Summarization of Open-Domain Multiparty Dialogues in Diverse Genres , 2002, CL.

[20]  W. Levelt,et al.  Monitoring and self-repair in speech , 1983, Cognition.

[21]  Gökhan Tür,et al.  Modeling the prosody of hidden events for improved word recognition , 1999, EUROSPEECH.

[22]  Donald Hindle,et al.  Deterministic Parsing of Syntactic Non-fluencies , 1983, ACL.

[23]  Mari Ostendorf,et al.  The Role of Disfluencies in Topic Classification of Human-Human Conversations , 2005 .

[24]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[25]  Richard M. Schwartz,et al.  The effects of speech recognition and punctuation on information extraction performance , 2005, INTERSPEECH.

[26]  Andreas Stolcke,et al.  Human language technology: opportunities and challenges , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[27]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[28]  Anthony J. Liddicoat,et al.  The projectability of turn constructional units and the role of prediction in listening , 2004 .

[29]  Shrikanth S. Narayanan,et al.  A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Mari Ostendorf,et al.  Parsing Conversational Speech Using Enhanced Segmentation , 2004, NAACL.

[31]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[32]  John S. Garofolo,et al.  THE RICH TRANSCRIPTION 2004 SPRING MEETING RECOGNITION EVALUATION , 2004 .

[33]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[34]  Sadaoki Furui,et al.  Speech-to-text and speech-to-speech summarization of spontaneous speech , 2004, IEEE Transactions on Speech and Audio Processing.

[35]  Tanja Schultz,et al.  Automatic disfluency removal on recognized spontaneous speech - rapid adaptation to speaker-dependent disfluencies , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[37]  Richard M. Schwartz,et al.  A Lexically-Driven Algorithm for Disfluency Detection , 2004, NAACL.

[38]  Elmar Nöth,et al.  How to find trouble in communication , 2003, Speech Commun..

[39]  Shrikanth S. Narayanan,et al.  Combining acoustic and language information for emotion recognition , 2002, INTERSPEECH.

[40]  Elmar Nöth,et al.  Integrated dialog act segmentation and classification using prosodic features and language models , 1997, EUROSPEECH.

[41]  Jeff A. Bilmes,et al.  Multi-Speaker Language Modeling , 2004, HLT-NAACL.

[42]  Tetsunori Kobayashi,et al.  Back-channel feedback generation using linguistic and nonlinguistic information and its application to spoken dialogue system , 2005, INTERSPEECH.

[43]  Douglas A. Reynolds,et al.  The SuperSID project: exploiting high-level information for high-accuracy speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[44]  Andreas Stolcke,et al.  A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[45]  Elmar Nöth,et al.  "Of all things the measure is man" automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..