Talker Diarization in the Wild: the Case of Child-centered Daylong Audio-recordings

Speaker diarization (answering ’who spoke when’) is a widely researched subject within speech technology. Numerous experiments have been run on datasets built from broadcast news, meeting data, and call centers—the task sometimes appears close to being solved. Much less work has begun to tackle the hardest diarization task of all: spontaneous conversations in real-world settings. Such diarization would be particularly useful for studies of language acquisition, where researchers investigate the speech children produce and hear in their daily lives. In this paper, we study audio gathered with a recorder worn by small children as they went about their normal days. As a result, each child was exposed to different acoustic environments with a multitude of background noises and a varying number of adults and peers. The inconsistency of speech and noise within and across samples poses a challenging task for speaker diarization systems, which we tackled via retraining and data augmentation techniques. We further studied sources of structured variation across raw audio files, including the impact of speaker type distribution, proportion of speech from children, and child age on diarization performance. We discuss the extent to which these findings might generalize to other samples of speech in the wild.

[1]  John H. L. Hansen,et al.  Speaker independent diarization for child language environment analysis using deep neural networks , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[3]  John H. L. Hansen,et al.  Signal processing for young child speech language development , 2008, WOCCI.

[4]  E. Lieven,et al.  Studying language acquisition cross-linguistically , 2013 .

[5]  Björn W. Schuller,et al.  A machine learning based system for the automatic evaluation of aphasia speech , 2017, 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom).

[6]  Dongxin Xu,et al.  Child vocalization composition as discriminant information for automatic autism detection , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[7]  Umit Yapanel,et al.  Reliability of the LENA Language Environment Analysis System in Young Children’s Natural Home Environment , 2009 .

[8]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Umit Yapanel,et al.  The LENA TM Language Environment Analysis System: The Interpreted Time Segments (ITS) File , 2009 .

[10]  Brian MacWhinney,et al.  The CHILDES Project Part 1: The CHAT Transcription Format , 2009 .

[11]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Brian MacWhinney,et al.  The CHILDES Project: Tools for Analyzing Talk (third edition): Volume I: Transcription format and programs, Volume II: The database , 2000, Computational Linguistics.

[13]  G. Weismer,et al.  Cross-linguistic studies of children's and adults' vowel spaces. , 2012, The Journal of the Acoustical Society of America.

[14]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[15]  Fangfang Li,et al.  Language-specific developmental differences in speech production: a cross-language acoustic study. , 2012, Child development.

[16]  Michael C. Frank,et al.  Speaker-independent detection of child-directed speech , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[17]  M. Soderstrom,et al.  Entrainment of prosody in the interaction of mothers with their young children* , 2015, Journal of Child Language.

[18]  John H. L. Hansen,et al.  Active Learning Based Constrained Clustering For Speaker Diarization , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Pradip K. Das,et al.  Speaker Diarization: A review , 2016, 2016 International Conference on Signal Processing and Communication (ICSC).

[21]  J. Gilkerson,et al.  The LENA Natural Language Study , 2009 .

[22]  Mary E Beckman,et al.  Methodological questions in studying consonant acquisition , 2008, Clinical linguistics & phonetics.

[23]  Eunjong Kong,et al.  Voice onset time is necessary but not always sufficient to describe acquisition of voiced stops: The cases of Greek and Japanese , 2012, J. Phonetics.

[24]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[25]  Mark VanDam,et al.  Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications , 2016, PloS one.

[26]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).