Children's speaker verification in low and zero resource conditions

Abstract Our efforts towards developing an automatic speaker verification (ASV) system for child speakers are presented in this paper. For the majority of the languages, children's speech data for training the ASV system is either unavailable (zero-resource) or very limited (low-resource). Under low- and zero-resource conditions, developing an ASV system becomes a very challenging problem. To overcome this issue, we have studied the effectiveness of in-domain and out-of-domain data augmentation in this work. Speed and pitch modifications of children's speech are employed for synthetically creating data in the case of in-domain data augmentation. On the other hand, a limited amount of adults' speech is used when out-of-domain data augmentation is performed. Using adults' speech leads to severe acoustic mismatch due to dissimilarity in the attributes of speech data from adult and child speakers. To address this drawback, speech data from adult speakers are subjected to voice conversion (VC) to alter the acoustic attributes. A cycle-consistent generative adversarial network is used in this work for voice conversion. Voice conversion renders adults' speech perceptually similar to children's speech. The voice converted adults' data can then be used for augmentation, ensuring that the acoustic mismatch is minimal. To study the effectiveness of proposed data augmentation techniques experimentally, x-vector-based ASV system architecture is employed. At the same time, the role of i-vector is also studied in this paper. As a consequence of data augmentation, both equal error rate and minimum decision cost function are reduced significantly in low- and zero-resource conditions. At the same time, employing i-vectors for modeling speaker characteristics is noted to be superior. Finally, we have also presented a detailed study on the effect of data augmentation with child speakers' age variation.

[1]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  S. Shahnawazuddin,et al.  In-Domain and Out-of-Domain Data Augmentation to Improve Children’s Speaker Verification System in Limited Data Scenario , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Saeid Safavi,et al.  Automatic speaker, age-group and gender identification from children's speech , 2018, Comput. Speech Lang..

[6]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.

[7]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Syed Shahnawazuddin,et al.  Effect of Prosody Modification on Children's ASR , 2017, IEEE Signal Processing Letters.

[9]  Shrikanth S. Narayanan,et al.  A review of ASR technologies for children's speech , 2009, WOCCI.

[10]  B. Yegnanarayana,et al.  Fast prosody modification using instants of significant excitation , 2010 .

[11]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Maryam Najafian,et al.  Speaker Recognition for Children's Speech , 2016, INTERSPEECH.

[14]  Saeid Safavi,et al.  Comparison of two scoring method within i-vector framework for speaker recognition from children's speech , 2017, WOCCI.

[15]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[16]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[17]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[18]  Maryam Najafian,et al.  Comparison of speaker verification performance for adult and child speech , 2014, WOCCI.

[19]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[21]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[22]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.