Do the Urgent Things first! - Detecting Urgency in Spoken Utterances based on Acoustic Features

In the future, spoken dialogue systems will have to deal with more complex user utterances and should react in an intuitive, comprehensible way by adapting to the user, the situation and the context. In rapidly changing situations, like talking to a highly automated car, it is highly relevant to react adequately to quick urgent interjections whether within one utterance or as interruptions of ongoing actions/dialogues. A first step is the detection of urgency in user utterances. Therefore, we developed a user study based on gamification simulating such short-term urgent situations. With this study, we collected data for a first analysis of features from the audio signal, which are promising for detecting urgent utterances. In the game "What is it?" participants had to find a symbol consisting of three characteristics from a set via speech. Their search was regularly interrupted by a time limited urgent task. The data obtained show that features only from the audio signal can be used to distinguish between urgent and non-urgent utterances. Further analysis reveals that certain features of the audio signal represent different phases of the data set better or worse. We distinguish, among other things, between the phases Transition and Decline, which represent the shift from non-urgent to urgent speech and vice versa. These shifts are recognizable and can occur in rapid change. We identified several classification methods to detect successfully urgent speech in each phase.

[1]  D. Dubois,et al.  Influence of sound immersion and communicative interaction on the Lombard effect. , 2010, Journal of speech, language, and hearing research : JSLHR.

[2]  Rosalind W. Picard,et al.  Modeling drivers' speech under stress , 2003, Speech Commun..

[3]  Dietmar F. Rösner,et al.  LAST MINUTE: a Multimodal Corpus of Speech-based User-Companion Interactions , 2012, LREC.

[4]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[5]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[6]  Maëva Garnier,et al.  Hyper-articulation in Lombard speech: An active communicative strategy to enhance visible speech cues? , 2018, The Journal of the Acoustical Society of America.

[7]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  Everlin Piccinini,et al.  The Impact of Gamification-Induced Emotions on In-car IS Adoption -- The Difference between Digital Natives and Digital Immigrants , 2016, 2016 49th Hawaii International Conference on System Sciences (HICSS).

[10]  Sharon L. Oviatt,et al.  Predicting hyperarticulate speech during human-computer error resolution , 1998, Speech Commun..

[11]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[12]  Daniel Gatica-Perez,et al.  StressSense: detecting stress in unconstrained acoustic environments using smartphones , 2012, UbiComp.

[13]  A. Seidl,et al.  The hyperarticulation hypothesis of infant-directed speech* , 2013, Journal of Child Language.

[14]  Frank Honold,et al.  Multimodal Interaction History and its use in Error Detection and Recovery , 2014, ICMI.

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Amanda Stent,et al.  Adapting speaking after evidence of misrecognition: Local and global hyperarticulation , 2008, Speech Commun..

[17]  John H. L. Hansen,et al.  Feature analysis and neural network-based classification of speech under stress , 1996, IEEE Trans. Speech Audio Process..

[18]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[19]  L. Nygaard,et al.  The Semantics of Prosody: Acoustic and Perceptual Evidence of Prosodic Correlates to Word Meaning , 2009, Cogn. Sci..

[20]  John H. L. Hansen,et al.  Speech under stress conditions: overview of the effect on speech production and on system performance , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[21]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[22]  A. Pauzie,et al.  A method to assess the driver mental workload: The driving activity load index (DALI) , 2008 .

[23]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[24]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[25]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[26]  Michele Gubian,et al.  "Excuse meeee!!": (Mis)coordination of lexical and paralinguistic prosody in L2 hyperarticulation , 2018, Speech Commun..

[27]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[28]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[29]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[30]  More Noise, Less Talk – The Impact of Driving Noise and In-Car Communication Systems on Acoustic-Prosodic Parameters in Dialogue , 2017 .

[31]  Leif E. Peterson K-nearest neighbor , 2009, Scholarpedia.

[32]  Geeta Nijhawan,et al.  ISOLATED SPEECH RECOGNITIONUSING MFCC AND DTW , 2013 .

[33]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[34]  Stanley Peters,et al.  Multi-tasking and Collaborative Activities in Dialogue Systems , 2002, SIGDIAL Workshop.

[35]  Nigel Gilbert,et al.  Simulating speech systems , 1991 .

[36]  Christine Kitamura,et al.  Vowel Hyperarticulation in Parrot-, Dog- and Infant-Directed Speech , 2013 .

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.