On the use of acoustic features for automatic disambiguation of homophones in spontaneous German

Abstract Homophones pose serious issues for automatic speech recognition (ASR) as they have the same pronunciation but different meanings or spellings. Homophone disambiguation is usually done within a stochastic language model or by an analysis of the homophonous word’s context, similarly to word sense disambiguation. Whereas this method reaches good results in read speech, it fails in conversational, spontaneous speech, where utterances are often short, contain disfluencies and/or are realized syntactically incomplete. Phonetic studies, however, have shown that words that are homophonous in read speech often differ in their phonetic detail in spontaneous speech. Whereas humans use phonetic detail to disambiguate homophones, this linguistic information is usually not explicitly incorporated into ASR systems. In this paper, we show that phonetic detail can be used to automatically disambiguate homophones using the example of German pronouns. Using 3179 homophonous tokens from a corpus of spontaneous German and a set of acoustic features, we trained a random forest model. Our results show that homophones can be disambiguated reasonably well using acoustic features (74% F1, 92% accuracy). In particular, this model is able to outperform a model based on lexical context (48% F1, 89% accuracy). This paper is of relevance for speech technologists and linguists: amodule using phonetic detail similar to the presented model is suitable to be integrated in ASR systems in order to improve recognition. An approach similar to the work here that combines the automatic extraction of acoustic features with statistical analysis is suitable to be integrated in phonetic analysis aiming at finding out more about the contribution and interplay of acoustic features for functional categories.

[1]  Francisco Torreira,et al.  The effects of processing and sequence organization on the timing of turn taking: a corpus study , 2015, Front. Psychol..

[2]  Oliver Niebuhr,et al.  Between recognition and resignation: The prosodic forms and communicative functions of the Czech confirmation tag "jasně" , 2014 .

[3]  Lori Lamel,et al.  Do speech recognizers prefer female speakers? , 2005, INTERSPEECH.

[4]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[5]  Sarah Hawkins,et al.  PHONETIC DIFFERENCES BETWEEN MIS- AND DIS- IN ENGLISH PREFIXED AND PSEUDO-PREFIXED WORDS , 2007 .

[6]  Loïc Kessous,et al.  The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals , 2007, INTERSPEECH.

[7]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[8]  Rena Nemoto,et al.  Speech Errors on Frequently Observed Homophones in French: Perceptual Evaluation vs Automatic Classification , 2008, LREC.

[9]  Sarah Hawkins,et al.  polysp: a polysystemic, phonetically-rich approach to speech understanding , 2001 .

[10]  Sharon Goldwater,et al.  Unsupervised Dependency Parsing with Acoustic Cues , 2013, Transactions of the Association for Computational Linguistics.

[11]  Mari Ostendorf,et al.  PROSODY MODELS FOR CONVERSATIONAL SPEECH RECOGNITION , 2003 .

[12]  Dan Jurafsky,et al.  Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. , 2003, The Journal of the Acoustical Society of America.

[13]  Yue-Shi Lee Task adaptation in stochastic language model for Chinese homophone disambiguation , 2003, TALIP.

[14]  William D. Raymond,et al.  Reduction of English function words in switchboard , 1998, ICSLP.

[15]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[16]  Wolfgang U. Dressler,et al.  Homophonous phonotactic and morphonotactic consonant clusters in word-final position , 2015, INTERSPEECH.

[17]  Sadaoki Furui,et al.  Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance , 2008, Comput. Speech Lang..

[18]  R. Harald Baayen,et al.  Models, forests, and trees of York English: Was/were variation as a case study for statistical practice , 2012, Language Variation and Change.

[19]  Lukás Burget,et al.  Morphological random forests for language modeling of inflectional languages , 2008, 2008 IEEE Spoken Language Technology Workshop.

[20]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[21]  Matthias Pätzold,et al.  From scenario to segment: the controlled elicitation, transcription, segmentation and labelling of spontaneous speech , 1995 .

[22]  Thierry Aubin,et al.  SEEWAVE, A FREE MODULAR TOOL FOR SOUND ANALYSIS AND SYNTHESIS , 2008 .

[23]  Leendert Plug,et al.  Timing and tempo in spontaneous phonological error repair , 2014, J. Phonetics.

[24]  Natalia Levshina,et al.  How to do Linguistics with R: Data exploration and statistical analysis , 2015 .

[25]  F.-Xavier Alario,et al.  Lexical representation of phonological variants: Evidence from pseudohomophone effects in different regiolects , 2011 .

[26]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[27]  Janet B. Pierrehumbert,et al.  Word-specific phonetics , 2001 .

[28]  Udaya B. Kogalur,et al.  High-Dimensional Variable Selection for Survival Data , 2010 .

[29]  Richard M. Stern,et al.  On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[30]  T. Jaeger,et al.  Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. , 2008, Journal of memory and language.

[31]  Loïc Kessous,et al.  Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech , 2011, Comput. Speech Lang..

[32]  Francisco Torreira,et al.  Probabilistic effects on French [t] duration , 2009, INTERSPEECH.

[33]  Peng Xu,et al.  Random Forests in Language Modelin , 2004, EMNLP.

[34]  John Local,et al.  Variable domains and variable relevance: interpreting phonetic exponents , 2003, J. Phonetics.

[35]  L. Nygaard,et al.  Resolution of lexical ambiguity by emotional tone of voice , 2002 .

[36]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[37]  Mirjam Ernestus,et al.  The effect of speech situation on the occurrence of reduced word pronunciation variants , 2015, J. Phonetics.

[38]  Sadaoki Furui,et al.  Error analysis using decision trees in spontaneous presentation speech recognition , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[39]  Oliver Niebuhr,et al.  Perception of phonetic detail in the identification of highly reduced words , 2011, J. Phonetics.

[40]  Katie Drager,et al.  Sociophonetic variation and the lemma , 2011, J. Phonetics.

[41]  Barbara Schuppler,et al.  How linguistic and probabilistic properties of a word affect the realization of its final /t/: Studies at the phonemic and sub-phonemic level , 2012, J. Phonetics.

[42]  S. Gahl Time and Thyme Are not Homophones: The Effect of Lemma Frequency on Word Durations in Spontaneous Speech , 2008 .

[43]  Daniel Jurafsky,et al.  The role of the lemma in form variation , 2002 .

[44]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[45]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[46]  Frédéric Béchet,et al.  Large Span statistical language models: application to homophone disambiguation for large vocabulary speech recognition in French , 1999, EUROSPEECH.

[47]  Daniel Jurafsky,et al.  Which Words Are Hard to Recognize? Prosodic, Lexical, and Disfluency Factors that Increase ASR Error Rates , 2008, ACL.

[48]  Petra Wagner,et al.  Effects of lexical class and lemma frequency on German homographs , 2013, INTERSPEECH.

[49]  Sarah Hawkins,et al.  Production and perception of speaker-specific phonetic detail at word boundaries , 2012, J. Phonetics.