Mixture of Speaker-type PLDAs for Children's Speech Diarization

In diarization, the PLDA is typically used to model an inference structure which assumes the variation in speech segments be induced by various speakers. The speaker variation is then learned from the training data. However, human perception can differentiate speakers by age, gender, among other characteristics. In this paper, we investigate a speaker-type informed model that explicitly captures the known variation of speakers. We explore a mixture of three PLDA models, where each model represents an adult female, male, or child category. The weighting of each model is decided by the prior probability of its respective class, which we study. The evaluation is performed on a subset of the BabyTrain corpus. We examine the expected performance gain using the oracle speaker type labels, which yields an 11.7% DER reduction. We introduce a novel baby vocalization augmentation technique and then compare the mixture model to the single model. Our experimental result shows an effective 0.9% DER reduction obtained by adding vocalizations. We discover empirically that a balanced dataset is important to train the mixture PLDA model, which outperforms the single PLDA by 1.3% using the same training data and achieving a 35.8% DER. The same setup improves over a standard baseline by 2.8% DER.

[1]  John H. L. Hansen,et al.  Speaker independent diarization for child language environment analysis using deep neural networks , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[3]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[4]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[5]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[6]  Fabio Valente,et al.  DiarTk : An Open Source Toolkit for Research in Multistream Speaker Diarization and its Application to Meetings Recordings , 2012, INTERSPEECH.

[7]  Björn W. Schuller,et al.  The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical & Self-Assessed Affect, Crying & Heart Beats , 2018, INTERSPEECH.

[8]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[9]  Alejandrina Cristià,et al.  Talker Diarization in the Wild: the Case of Child-centered Daylong Audio-recordings , 2018, INTERSPEECH.

[10]  Umit Yapanel,et al.  Reliability of the LENA Language Environment Analysis System in Young Children’s Natural Home Environment , 2009 .

[11]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[12]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[13]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[14]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Patrick Kenny,et al.  Mixture of PLDA Models in i-vector Space for Gender-Independent Speaker Recognition , 2011, INTERSPEECH.

[16]  Alejandrina Cristia,et al.  HomeBank: An Online Repository of Daylong Child-Centered Audio Recordings , 2016, Seminars in Speech and Language.

[17]  Dongxin Xu,et al.  The LENA , 2009 .

[18]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[21]  Xin Wang,et al.  Speaker detection in the wild: Lessons learned from JSALT 2019 , 2019, Odyssey.

[22]  Jun Du,et al.  A Novel LSTM-Based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[24]  Shweta Ghai,et al.  A Study on the Effect of Pitch on LPCC and PLPC Features for Children's ASR in Comparison to MFCC , 2011, INTERSPEECH.

[25]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Jen-Tzung Chien,et al.  Mixture of PLDA for Noise Robust I-Vector Speaker Verification , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Shrikanth S. Narayanan,et al.  Analyzing Children's Speech: An Acoustic Study of Consonants and Consonant-Vowel Transition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29]  Alan McCree,et al.  Speaker Diarization Using Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings , 2019, INTERSPEECH.