Prosody in Automatic Speech Processing

We understand Automatic Speech Processing (ASP) to mean word recognition (Automatic Speech Recognition, ASR), processing of higher linguistic components (syntax, semantics, and pragmatics), and processing of Computational Paralinguistics (CP). This chapter attempts to describe the role of prosody in ASP from the word level up to the level of CP, where the focus was initially on emotion recognition and later extended to the recognition of health conditions, social signals such as backchannelling, and speaker states and traits (Schuller & Batliner 2014). Automatic processing of prosody means that at least part of the processing is done by the computer. The automatic part can be small, e.g., pertaining only to pitch extraction, followed by manual correction of the F0 values with subsequent automatic computation of characteristic values such as mean, minimum, or maximum. This is typically done in basic, possibly exploratory, research on prosody and in studies aiming to evaluate certain models and theories. A fully automatic processing of prosody, on the other hand,

[1]  Wilhelm Oppenrieder,et al.  EINE FRAGE IST EINE FRAGE IST KEINE FRAGE. PERZEPTIONSEXPERIMENTE ZUM FRAGEMODUS IM DEUTSCHEN , 1989 .

[2]  Elmar Nöth,et al.  Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground , 2001, INTERSPEECH.

[3]  Anton Batliner,et al.  Emotion Analysis and Emotion-Handling Subdialogues , 2006, SmartKom.

[4]  C. W. Wightman ToBI Or Not ToBI ? , 2002 .

[5]  Francisco Gomes de Matos,et al.  How different are we? Spoken Discourse in Intercultural Communication , 2003 .

[6]  Katherine Hilton,et al.  The Perception of Overlapping Speech: Effects of Speaker Prosody and Listener Attitudes , 2016, INTERSPEECH.

[7]  Elizabeth Shriberg,et al.  Higher-Level Features in Speaker Recognition , 2007, Speaker Classification.

[8]  S. Hurtley How Different Are We? , 2003, Science.

[9]  Björn W. Schuller,et al.  Mothers, adults, children, pets — towards the acoustics of intimacy , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  R. Coe,et al.  It's the Effect Size, Stupid What effect size is and why it is important , 2012 .

[11]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[12]  Mari Ostendorf,et al.  Parse scoring with prosodic information: an analysis/synthesis approach , 1993, Comput. Speech Lang..

[13]  Andrew Rosenberg,et al.  Automatic detection and classification of prosodic events , 2009 .

[14]  Wayne A. Lea,et al.  Prosodic Aids to Speech Recognition , 1972 .

[15]  Elmar Nöth,et al.  The Automatic Assessment of Non-native Prosody: Combining Classical Prosodic Analysis with Acoustic Modelling , 2012, INTERSPEECH.

[16]  Fabien Ringeval,et al.  Affective and behavioural computing: Lessons learnt from the First Computational Paralinguistics Challenge , 2019, Comput. Speech Lang..

[17]  Frank Dellaert,et al.  Recognizing emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[18]  Jacqueline Vaissière On automatic extraction of prosodic information for automatic speech recognition system , 1989, EUROSPEECH.

[19]  Andreas Stolcke,et al.  Prosody Modeling for Automatic Speech Understanding: An Overview of Recent Research at SRI , 2008 .

[20]  David Escudero Mancebo,et al.  Acoustic characterization and perceptual analysis of the relative importance of prosody in speech of people with Down syndrome , 2018, Speech Commun..

[21]  Florian Hönig Automatic assessment of prosody in second language learning = Automatische Bewertung von Prosodie beim Fremdsprachenlernen , 2017 .

[22]  Elmar Nöth,et al.  M = Syntax + Prosody: A syntactic-prosodic labelling scheme for large spontaneous speech databases , 1998, Speech Commun..

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[24]  Christian Müller,et al.  Speaker Classification I: Fundamentals, Features, and Methods , 2007, Speaker Classification.

[25]  A. Rosenberg Speech, Prosody, and Machines: Nine Challenges for Prosody Research , 2018, Speech Prosody 2018.

[26]  Andrew Rosenberg,et al.  Classifying Skewed Data: Importance Weighting to Optimize Average Recall , 2012, INTERSPEECH.

[27]  Julia Hirschberg,et al.  Automatic classification of intonational phrase boundaries , 1992 .

[28]  Alex Waibel,et al.  Prosody and speech recognition , 1988 .

[29]  Erik Marchi,et al.  Emotion in the speech of children with autism spectrum conditions: prosody and everything else , 2012, WOCCI.

[30]  Jesús Francisco Vargas-Bonilla,et al.  Low-frequency components analysis in running speech for the automatic detection of parkinson's disease , 2015, INTERSPEECH.

[31]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[32]  Wolfgang Wahlster,et al.  SmartKom: Foundations of Multimodal Dialogue Systems , 2006, SmartKom.

[33]  Elmar Nöth,et al.  Whence and Whither Prosody in Automatic Speech Understanding: A Case Study , 2002 .

[34]  Elmar Nöth,et al.  How to repair speech repairs in an end-to-end system , 2001, DiSS.

[35]  Elmar Nöth,et al.  The prediction of focus , 1989, EUROSPEECH.

[36]  Elmar Nöth,et al.  PROSODIC FEATURE EVALUATION: BRUTE FORCE OR WELL DESIGNED? , 1999 .

[37]  Loïc Kessous,et al.  Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech , 2011, Comput. Speech Lang..

[38]  R. L. Thorndike Who belongs in the family? , 1953 .

[39]  HuangXuedong,et al.  Toward Human Parity in Conversational Speech Recognition , 2017 .

[40]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[41]  Briony Williams,et al.  5. The phonetic manifestation of word stress , 1999 .

[42]  Alfons Trompenaars,et al.  Riding the Waves of Culture: Understanding Diversity in Global Business , 1993 .

[43]  Rahul Gupta,et al.  Automated evaluation of non-native English pronunciation quality: combining knowledge- and data-driven features at multiple time scales , 2015, INTERSPEECH.

[44]  Shrikanth S. Narayanan,et al.  Robust Unsupervised Arousal Rating:A Rule-Based Framework withKnowledge-Inspired Vocal Features , 2014, IEEE Transactions on Affective Computing.

[45]  Elmar Nöth,et al.  Automatic modelling of depressed speech: relevant features and relevance of gender , 2014, INTERSPEECH.

[46]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[47]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[48]  Elmar Nöth,et al.  Acoustic-Prosodic Characteristics of Sleepy Speech - Between Performance and Interpretation , 2014 .

[49]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[50]  Anton Batliner,et al.  Speaker Characteristics and Emotion Classification , 2007, Speaker Classification.

[51]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[52]  Andrew Rosenberg,et al.  Let me finish: automatic conflict detection using speaker overlap , 2013, INTERSPEECH.

[53]  P. Lieberman Some Acoustic Correlates of Word Stress in American English , 1959 .

[54]  Jacqueline Vaissière,et al.  The use of prosodic parameters in automatic speech recognition , 1988 .

[55]  Elmar Nöth,et al.  VERBMOBIL: the use of prosody in the linguistic components of a speech understanding system , 2000, IEEE Trans. Speech Audio Process..

[56]  B. Rosner,et al.  Loudness predicts prominence: fundamental frequency lends little. , 2005, The Journal of the Acoustical Society of America.