论文信息 - Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Lexical Information Fusion

Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Lexical Information Fusion

Textual escalation detection has been widely applied to ecommerce companies’ customer service systems to pre-alert and prevent potential conflicts. Similarly, in public areas such as airports and train stations, where many impersonal conversations frequently take place, acoustic-based escalation detection systems are also useful to enhance passengers’ safety and maintain public order. To this end, we introduce a system based on acoustic-lexical features to detect escalation from speech, Voice Activity Detection (VAD) and label smoothing are adopted to further enhance the performance in our experiments. Considering a small set of training and development data, we also employ transfer learning on several well-known emotional detection datasets, i.e. RAVDESS, CREMA-D, to learn advanced emotional representations that is then applied to the conversational escalation detection task. On the development set, our proposed system achieves 81.5% unweighted average recall (UAR) which significantly outperforms the baseline with 72.2% UAR.

[1] Xi Chen,et al. Automatic Conflict Detection in Police Body-Worn Audio , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Gertjan J. Burghouts,et al. An audio-visual dataset of human–human interactions in stressful situations , 2014, Journal on Multimodal User Interfaces.

[3] Cecilia Mascolo,et al. The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates , 2021, Interspeech.

[4] Benjamin Schrauwen,et al. Transfer Learning by Supervised Pre-training for Audio-based Music Classification , 2014, ISMIR.

[5] Zhigang Deng,et al. Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[6] Fabio Valente,et al. Automatic detection of conflict escalation in spoken conversations , 2012, INTERSPEECH.

[7] Wei Zhao,et al. Research on the deep learning of the small sample data based on transfer learning , 2017 .

[8] Margaret Lech,et al. Towards real-time Speech Emotion Recognition using deep neural networks , 2015, 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS).

[9] Ragini Verma,et al. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[10] Yichuan Tang,et al. Deep Learning using Linear Support Vector Machines , 2013, 1306.0239.

[11] George Trigeorgis,et al. End-to-End Multimodal Emotion Recognition Using Deep Neural Networks , 2017, IEEE Journal of Selected Topics in Signal Processing.

[12] Carlos Busso,et al. IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[13] Andrew Rosenberg,et al. Let me finish: automatic conflict detection using speaker overlap , 2013, INTERSPEECH.

[14] Emily Mower Provost,et al. Progressive Neural Networks for Transfer Learning in Emotion Recognition , 2017, INTERSPEECH.

[15] Stefan Winkler,et al. Deep Learning for Emotion Recognition on Small Datasets using Transfer Learning , 2015, ICMI.

[16] Ray Kurzweil,et al. Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[17] Matthai Philipose,et al. Limiting Numerical Precision of Neural Networks to Achieve Real-Time Voice Activity Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Iryna Gurevych,et al. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Claude Montacié,et al. Detecting Speech Interruptions for Automatic Conflict Detection , 2015 .

[21] Qiang Chen,et al. Network In Network , 2013, ICLR.

[22] S. R. Livingstone,et al. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[23] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[24] Geoffrey E. Hinton,et al. When Does Label Smoothing Help? , 2019, NeurIPS.

[25] G. J. Burghouts,et al. Automatic Audio-Visual Fusion for Aggression Detection Using Meta-information , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.