Phone Duration Modeling for Speaker Age Estimation in Children

Automatic inference of important paralinguistic information such as age from speech is an important area of research with numerous spoken language technology based applications. Speaker age estimation has applications in enabling personalization and age-appropriate curation of information and content. However, research in speaker age estimation in children is especially challenging due to paucity of relevant speech data representing the developmental spectrum, and the high signal variability especially intra age variability that complicates modeling. Most approaches in children speaker age estimation adopt methods directly from research on adult speech processing. In this paper, we propose features specific to children and focus on speaker’s phone duration as an important biomarker of children’s age. We propose phone duration modeling for predicting age from child’s speech. To enable that, children speech is first forced aligned with the corresponding transcription to derive phone duration distributions. Statistical functionals are computed from phone duration distributions for each phoneme which are in turn used to train regression models to predict speaker age. Two children speech datasets are employed to demonstrate the robustness of phone duration features. We perform age regression experiments on age categories ranging from children studying in kindergarten to grade 10. Experimental results suggest phone durations contain important development-related information of children. Phonemes contributing most to estimation of children speaker age are analyzed and presented.

[1]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[2]  Seyed Omid Sadjadi,et al.  Speaker age estimation on conversational telephone speech using senone posterior based i-vectors , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Joanna Grzybowska,et al.  Speaker Age Classification and Regression Using i-Vectors , 2016, INTERSPEECH.

[4]  Shrikanth S. Narayanan,et al.  Automatic speaker age and gender recognition using acoustic and prosodic level information fusion , 2013, Comput. Speech Lang..

[5]  Héctor A. Sánchez-Hevia,et al.  Convolutional-recurrent Neural Network for Age and Gender Prediction from Speech , 2019, 2019 Signal Processing Symposium (SPSympo).

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Hugo Van hamme,et al.  Speaker age estimation using i-vectors , 2014, Eng. Appl. Artif. Intell..

[9]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[10]  Shrikanth S. Narayanan,et al.  Developmental acoustic study of American English diphthongs. , 2014, The Journal of the Acoustical Society of America.

[11]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[12]  Terrance M. Nearey,et al.  Perception of speaker age in children’s voices , 2013 .

[13]  Lukás Burget,et al.  Brno university of technology system for interspeech 2010 paralinguistic challenge , 2010, INTERSPEECH.

[14]  Buket D. Barkana,et al.  New transformed features generated by deep bottleneck extractor and a GMM–UBM classifier for speaker age and gender classification , 2017, Neural Computing and Applications.

[15]  Saeid Safavi,et al.  Automatic speaker, age-group and gender identification from children's speech , 2018, Comput. Speech Lang..

[16]  Modeling the perception of children's age from speech acoustics. , 2018, The Journal of the Acoustical Society of America.

[17]  Patrik Sörqvist,et al.  Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age , 2015, Front. Psychol..

[18]  Tomi Kinnunen,et al.  Exploring ANN back-ends for i-vector based speaker age estimation , 2015, INTERSPEECH.

[19]  Shrikanth Narayanan,et al.  Chapter 15 Behavioral signal processing and autism: Learning from multimodal behavioral signals , 2016 .

[20]  A. Esposito,et al.  Children speech pauses as markers of different discourse structures and utterance information content , 2004 .

[21]  Bruce L. Smith Temporal aspects of English speech production: A developmental perspective , 1978 .

[22]  Raymond D. Kent,et al.  Anatomic development of the oral and pharyngeal portions of the vocal tract: an imaging study. , 2009, The Journal of the Acoustical Society of America.

[23]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[24]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[25]  Saeid Safavi,et al.  Identification of age-group from children's speech by computers and humans , 2014, INTERSPEECH.

[26]  B. Leventhal,et al.  The Autism Diagnostic Observation Schedule—Generic: A Standard Measure of Social and Communication Deficits Associated with the Spectrum of Autism , 2000, Journal of autism and developmental disorders.

[27]  Ronald A. Cole,et al.  My science tutor: A conversational multimedia virtual tutor for elementary school science , 2011, TSLP.

[28]  Raymond D. Kent,et al.  Speech segment durations in sentence recitations by children and adults , 1980 .

[29]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[30]  Björn W. Schuller,et al.  Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..

[31]  Panayiotis G. Georgiou,et al.  Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations , 2018, Comput. Speech Lang..

[32]  R. Cole,et al.  THE OGI KIDS’ SPEECH CORPUS AND RECOGNIZERS , 2000 .

[33]  T. Gallagher,et al.  Revision behaviors in the speech of normal children developing language. , 1977, Journal of speech and hearing research.

[34]  Seyed Mostafa Mirhassani,et al.  Age Estimation Based on Children's Voice: A Fuzzy-Based Decision Fusion Strategy , 2014, TheScientificWorldJournal.

[35]  Kandarpa Kumar Sarma,et al.  Children’s Age and Gender Recognition from Raw Speech Waveform Using DNN , 2020 .

[36]  Sanjeev Khudanpur,et al.  End-to-end Deep Neural Network Age Estimation , 2018, INTERSPEECH.

[37]  Najim Dehak,et al.  Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks , 2018, IEEE Access.

[38]  Shrikanth S. Narayanan,et al.  Simplified and supervised i-vector modeling for speaker age regression , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Buket D. Barkana,et al.  DNN-based Models for Speaker Age and Gender Classification , 2017, BIOSIGNALS.

[40]  Elmar Nöth,et al.  Age Determination of Children in Preschool and Primary School Age with GMM-Based Supervectors and Support Vector Machines/Regression , 2008, TSD.

[41]  D K Oller,et al.  Automated vocal analysis of naturalistic recordings from children with autism, language delay, and typical development , 2010, Proceedings of the National Academy of Sciences.

[42]  Latika Singh,et al.  Developmental patterns of speech production in children , 2007 .

[43]  Shrikanth Narayanan,et al.  End-to-End Neural Systems for Automatic Children Speech Recognition: An Empirical Study , 2021, Comput. Speech Lang..

[44]  Raymond D. Kent,et al.  Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. , 1976, Journal of speech and hearing research.