An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings

Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.

[1]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[2]  G. Weismer,et al.  Cross-linguistic studies of children's and adults' vowel spaces. , 2012, The Journal of the Acoustical Society of America.

[3]  Umit Yapanel,et al.  The LENA TM Language Environment Analysis System: The Interpreted Time Segments (ITS) File , 2009 .

[4]  A. Eriks-Brophy,et al.  The Language ENvironment Analysis (LENA) system: A literature review , 2016 .

[5]  Jane S. Tsay Construction and Automatization of a Minnan Child Speech Corpus with some Research Findings , 2007, ROCLING/IJCLCLP.

[6]  Alejandrina Cristia,et al.  A thorough evaluation of the Language Environment Analysis (LENA) system. , 2020, Behavior research methods.

[7]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[8]  Patricia K. Kuhl,et al.  Parent coaching increases conversational turns and advances infant language development , 2020, Proceedings of the National Academy of Sciences.

[9]  Elika Bergelson,et al.  Day by day, hour by hour: Naturalistic language input to infants. , 2018, Developmental science.

[10]  Pavel Korshunov,et al.  Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hervé Bredin,et al.  pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[12]  S. Suter Meaningful differences in the everyday experience of young American children , 2005, European Journal of Pediatrics.

[13]  G. Wells Describing Children's Linguistic Development at Home and at School , 1979 .

[14]  Dwight W. Irvin,et al.  Using the Language Environment Analysis (LENA) system in preschool classrooms with children with autism spectrum disorders , 2013, Autism : the international journal of research and practice.

[15]  Xin Wang,et al.  Speaker detection in the wild: Lessons learned from JSALT 2019 , 2019, Odyssey.

[16]  Jeffrey J. Holliday,et al.  Quantifying the Robustness of the English Sibilant Fricative Contrast in Children. , 2015, Journal of speech, language, and hearing research : JSLHR.

[17]  Eunjong Kong,et al.  Voice onset time is necessary but not always sufficient to describe acquisition of voiced stops: The cases of Greek and Japanese , 2012, J. Phonetics.

[18]  Jill Gilkerson,et al.  Transcriptional Analyses of the LENA Natural Language Corpus , 2009 .

[19]  Fangfang Li,et al.  Language-specific developmental differences in speech production: a cross-language acoustic study. , 2012, Child development.

[20]  Hung Thai-Van,et al.  Reliability of the Language ENvironment Analysis system (LENA™) in European French , 2015, Behavior Research Methods.

[21]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[22]  Dongxin Xu,et al.  Automated Vocal Analysis of Children With Hearing Loss and Their Typical and Atypical Peers , 2015, Ear and hearing.

[23]  Shashidhar G. Koolagudi,et al.  Acoustic Event Classification Using Spectrogram Features , 2018, TENCON 2018 - 2018 IEEE Region 10 Conference.

[24]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[25]  Rebecca J. Panagos Meaningful Differences in the Everyday Experience of Young American Children , 1998 .

[26]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[27]  Alejandrina Cristia,et al.  HomeBank: An Online Repository of Daylong Child-Centered Audio Recordings , 2016, Seminars in Speech and Language.

[28]  Florian Metze,et al.  The ACLEW DiViMe: An Easy-to-use Diarization Tool , 2018, INTERSPEECH.

[29]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).