Developing Neural Representations for Robust Child-Adult Diarization

Automated processing and analysis of child speech has been long acknowledged as a harder problem compared to understanding speech by adults. Specifically, conversations between a child and adult involve spontaneous speech which often compounds idiosyncrasies associated with child speech. In this work, we improve upon the task of speaker diarization (determining who spoke when) from audio of child-adult conversations in naturalistic settings. We select conversations from the autism diagnosis and intervention domains, wherein speaker diarization forms an important step towards computational behavioral analysis in support of clinical research and decision making. We train deep speaker embeddings using publicly available child speech and adult speech corpora, unlike predominant state-of-art models which typically utilize only adult speech for speaker embedding training. We demonstrate significant reductions in relative diarization error rate (DER) on DIHARD II (dev) sessions containing child speech (22.88%) and two internal corpora representing interactions involving children with Autism: excerpts from ADOS Mod3 sessions (33.7%) and combination of full-length ADOS and BOSCC sessions (44.99%). Further, we validate our improvements in identifying the child speaker (typically with short speaking time) using the recall measure. Finally, we analyze the effect of fundamental frequency augmentation and the effect of child age, gender on speaker diarization performance.

[1]  C. Mazefsky,et al.  The discriminative ability and diagnostic utility of the ADOS-G, ADI-R, and GARS for children in a clinical setting , 2006, Autism : the international journal of research and practice.

[2]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[3]  Shrikanth Narayanan,et al.  Meta-Learning for Robust Child-Adult Classification from Speech , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jun Du,et al.  A Progressive Deep Learning Approach to Child Speech Separation , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[5]  B. Leventhal,et al.  The Autism Diagnostic Observation Schedule—Generic: A Standard Measure of Social and Communication Deficits Associated with the Spectrum of Autism , 2000, Journal of autism and developmental disorders.

[6]  Ming Li,et al.  Speaker diarization system for autism children's real-life audio data , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[7]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[8]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[9]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[10]  Sanjeev Khudanpur,et al.  Multi-PLDA Diarization on Children's Speech , 2019, INTERSPEECH.

[11]  Andrew Pickles,et al.  Measuring Changes in Social Communication Behaviors: Preliminary Development of the Brief Observation of Social Communication Change (BOSCC) , 2016, Journal of autism and developmental disorders.

[12]  Shrikanth S. Narayanan,et al.  Analyzing Children's Speech: An Acoustic Study of Consonants and Consonant-Vowel Transition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[13]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[14]  Shrikanth S. Narayanan,et al.  Developmental acoustic study of American English diphthongs. , 2014, The Journal of the Acoustical Society of America.

[15]  L. Kanner Autistic disturbances of affective contact. , 1968, Acta paedopsychiatrica.

[16]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[17]  Shrikanth Narayanan,et al.  Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap , 2020, IEEE Signal Processing Letters.

[18]  Valérie Hazan,et al.  The development of phonemic categorization in children aged 6-12 , 2000, J. Phonetics.

[19]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[20]  Shrikanth S. Narayanan,et al.  Acoustic analysis and automatic recognition of spontaneous children²s speech , 2006, INTERSPEECH.

[21]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[22]  Shrikanth S. Narayanan,et al.  Spoken dialog systems for children , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[24]  Stephan J Sanders,et al.  The female protective effect in autism spectrum disorder is not mediated by a single genetic locus , 2015, Molecular Autism.

[25]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Natacha Akshoomoff,et al.  The Role of the Autism Diagnostic Observation Schedule in the Assessment of Autism Spectrum Disorders in School and Community Settings , 2006, The California school psychologist : CASP.

[27]  Alejandrina Cristià,et al.  Talker Diarization in the Wild: the Case of Child-centered Daylong Audio-recordings , 2018, INTERSPEECH.

[28]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[29]  Panayiotis G. Georgiou,et al.  Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations , 2018, Comput. Speech Lang..