The Importance of Fillers for Text Representations of Speech Transcripts

While being an essential component of spoken language, fillers (e.g."um" or "uh") often remain overlooked in Spoken Language Understanding (SLU) tasks. We explore the possibility of representing them with deep contextualised embeddings, showing improvements on modelling spoken language and two downstream tasks - predicting a speaker's stance and expressed confidence.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[3]  H. H. Clark,et al.  Using uh and um in spontaneous speaking , 2002, Cognition.

[4]  Mari Ostendorf,et al.  On the Role of Style in Parsing Speech with Neural Models , 2019, INTERSPEECH.

[5]  Divya Saini The Effect of Speech Disfluencies on Turn-Taking , 2017 .

[6]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[7]  Barbara Schuppler,et al.  Automatic detection of uncertainty in spontaneous German dialogue , 2015, INTERSPEECH.

[8]  D. Donaldson,et al.  It’s the way that you, er, say it: Hesitations in speech affect language comprehension , 2007, Cognition.

[9]  Esther Le Grézause,et al.  Um and Uh, and the expression of stance in conversational speech , 2017 .

[10]  Ulrich Schade,et al.  Disfluencies and uncertainty perception - evidence from a human - machine scenario , 2013, DiSS.

[11]  Chloé Clavel,et al.  Opinion Dynamics Modeling for Movie Review Transcripts Classification with Hidden Conditional Random Fields , 2017, INTERSPEECH.

[12]  Catherine Pelachaud,et al.  How confident are you? Exploring the role of fillers in the automatic prediction of a speaker’s confidence , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  M. Swerts,et al.  Prosody as a Marker of Information Flow in Spoken Discourse , 1994 .

[14]  Elizabeth Shriberg To ‘errrr’ is human: ecology and acoustics of speech disfluencies , 2001, Journal of the International Phonetic Association.

[15]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[19]  H. H. Clark,et al.  On the Course of Answering Questions , 1993 .

[20]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  S. Brennan,et al.  THE FEELING OF ANOTHER'S KNOWING : PROSODY AND FILLED PAUSES AS CUES TO LISTENERS ABOUT THE METACOGNITIVE STATES OF SPEAKERS , 1995 .

[23]  Ashutosh Modi,et al.  Disney at IEST 2018: Predicting Emotions using an Ensemble , 2018, WASSA@EMNLP.

[24]  Robbert-Jan Beun,et al.  Filled pauses as markers of discourse structure , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25]  Robin J. Lickley,et al.  Disfluency patterns in dialogue processing , 2010, DiSS-LPSS.

[26]  Anton Osokin,et al.  Breaking Sticks and Ambiguities with Adaptive Skip-gram , 2015, AISTATS.

[27]  Andreas Stolcke,et al.  Statistical language modeling for speech disfluencies , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[28]  Matteo Manica,et al.  Guiding attention in Sequence-to-sequence models for Dialogue Act prediction , 2020, AAAI.

[29]  James Kennedy,et al.  Affect-Driven Dialog Generation , 2019, NAACL.

[30]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.