Classification of Formal and Informal Dialogues Based on Turn-Taking and Intonation Using Deep Neural Networks

Here, we introduce a classification method for distinguishing between formal and informal dialogues using feature sets based on prosodic data. One such feature set is the raw fundamental frequency values paired with speaker information (i.e. turn-taking). The other feature set we examine is the prosodic labels extracted from the raw F0 values via the ProsoTool algorithm, which is also complemented by turn-taking. We evaluated the two feature sets by comparing the accuracy scores our classification method got, which uses them to classify dialogue-excerpts taken from the HuComTech corpus. With the ProsoTool features we achieved an average accuracy score of \(85.2\%\), which meant a relative error rate reduction of \(24\%\) compared to the accuracy scores attained using F0 features. Regardless of the feature set applied, however, our method yields better accuracy scores than those got by human listeners, who only managed to distinguish between formal and informal dialogue to an accuracy level of \(56.5\%\).

[1]  László Tóth,et al.  Training HMM/ANN Hybrid Speech Recognizers by Probabilistic Sampling , 2005, ICANN.

[2]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[3]  András Beke,et al.  Laughter Classification Using Deep Rectifier Neural Networks with a Minimal Feature Subset , 2016 .

[4]  Tamás Váradi,et al.  Language technology tools and resources for the analysis of multimodal communication , 2016, LT4DH@COLING.

[5]  László Tóth Phone recognition with deep sparse rectifier neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Piet Mertens,et al.  The Prosogram: Semi-Automatic Transcription of Prosody Based on a Tonal Perception Model , 2004 .

[7]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[8]  T. IstvánNagy,et al.  Document Classification with Deep Rectifier Neural Networks and Probabilistic Sampling , 2014, TSD.

[9]  Ilana Mushin,et al.  Identifying Prosodic Indicators of Dialogue Structure: Some Methodological and Theoretical Considerations , 2000, SIGDIAL Workshop.

[10]  Ah Chung Tsoi,et al.  Neural Network Classification and Prior Class Probabilities , 1996, Neural Networks: Tricks of the Trade.

[11]  James F. Allen,et al.  A Study on Prosody and Discourse Structure in Cooperative Dialogues , 1993 .

[12]  Marijn Huijbregts,et al.  Segmentation, diarization and speech transcription : surprise data unraveled , 2008 .

[13]  Istvan Szekrenyes ProsoTool, a method for automatic annotation of fundamental frequency , 2015, 2015 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom).

[14]  György Kovács,et al.  Topical unit classification using deep neural nets and probabilistic sampling , 2016, 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom).

[15]  Jill House Prosody and Context Selection: A Procedural Approach , 2009 .

[16]  Margaret Zellers,et al.  Prosodic Variation for Topic Shift and Other Functions in Local Contrasts in Conversation , 2013, Phonetica.

[17]  Björn W. Schuller,et al.  Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..