Comparison of Heterogeneous Feature Sets for Intonation Verification

The assessment of intonation, to which intonation verification belongs, has many applications, such as health-impaired people training – from individuals with Parkinson’s disease to children with Autism Spectrum Disorders – and second language learning. Most of the approaches that are found in the literature are based on intensive preprocessing of the audio signal and hand-crafted feature extraction methods, and most of those works do not tackle the particularities of the Portuguese language. In this paper, we present our work on intonation assessment, developed from a database of binarily-labelled Portuguese intonation imitation. This has been done using the set of Low Level Descriptors (LLDs) and eGeMAPS, both extracted with the openSMILE toolkit, and Problem-Agnostic Speech Encoder (PASE) features. We have taken the most informative feature subsets for prosody out of these. Distances between stimulus and imitation – the so-called similarity intonation scores – have been computed applying Dynamic Time Warping (DTW) for different feature subsets, and have afterwards been used as input features of a binary classifier. Performance achieves up to 66.9% of accuracy in the test data set when considering only one feature set, and it increases up to 77.5% for a set of seven features.

[1]  Sharon Peperkamp,et al.  Discovering words in the continuous speech stream: the role of prosody , 2003, J. Phonetics.

[2]  A. Levitt The Acquisition of Prosody: Evidence from French- and English-learning Infants* , 1993 .

[3]  Tsuneo Kato,et al.  Automatic Assessment of L2 English Word Prosody Using Weighted Distances of F0 and Intensity Contours , 2018, INTERSPEECH.

[4]  Dorothy M. Chun Discourse Intonation in L2: From theory and research to practice , 2002 .

[5]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[6]  Anastassia Loukina,et al.  Using F0 contours to assess nativeness in a sentence repeat task , 2015, INTERSPEECH.

[7]  Kun Li,et al.  Intonation classification for L2 English speech using multi-distribution deep neural networks , 2017, Comput. Speech Lang..

[8]  I. Trancoso,et al.  Prosodic Exercises for Children with ASD via Virtual Therapy , 2017 .

[9]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[10]  F. Volkmar,et al.  Brief Report: Relations between Prosodic Performance and Communication and Socialization Ratings in High Functioning Speakers with Autism Spectrum Disorders , 2005, Journal of autism and developmental disorders.

[11]  Jian Cheng Automatic Assessment of Prosody in High-Stakes English Tests , 2011, INTERSPEECH.

[12]  Ciro Martins,et al.  The design of a large vocabulary speech corpus for portuguese , 1997, EUROSPEECH.

[13]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.

[14]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[15]  Néstor Becerra Yoma,et al.  Automatic intonation assessment for computer aided language learning , 2010, Speech Commun..

[16]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[17]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  J. Morton,et al.  Developmental Neurocognition: Speech and Face Processing in the First Year of Life , 2008 .

[20]  Valentín Cardeñoso-Payo,et al.  Exploratory use of automatic prosodic labels for the evaluation of Japanese speakers of L2 Spanish , 2016 .