A Study of Child Speech Extraction Using Joint Speech Enhancement and Separation in Realistic Conditions

In this paper, we design a novel joint framework of speech enhancement and speech separation for child speech extraction in realistic conditions, targeting the problem of extracting child speech from daily conversations in BabyTrain mega corpus. To the best of our knowledge, it is the first discussion of a feasible method for child speech extraction in realistic conditions. First, we make detailed analysis of the BabyTrain mega corpus, which is recorded in adverse environments. We observe problems of background noises, reverberations and child speech that is partially obscured by adult speech (for instance due to speaker overlap but also imitation by the adult). Motivated by this, we conduct a joint framework of speech enhancement and speech separation for child speech extraction. To measure the extraction results in realistic conditions, we propose several objective measurements to evaluate the performance of the our system, which is different from those commonly used for simulation data. Compared with the unprocessed approach and classification approach, our proposed approach can yield the best performance among all subsets of BabyTrain.

[1]  Jill Gilkerson,et al.  Transcriptional Analyses of the LENA Natural Language Corpus , 2009 .

[2]  Jun Du,et al.  SNR-Based Progressive Learning of Deep Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[3]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[4]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[5]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[7]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Anne S Warlaumont,et al.  Infant-adult vocal interaction dynamics depend on infant vocal type, child-directedness of adult speech, and timeframe. , 2019, Infant behavior & development.

[9]  Alejandrina Cristia,et al.  A step-by-step guide to collecting and analyzing long-format speech environment (LFSE) recordings , 2019, Collabra: Psychology.

[10]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[11]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[12]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  F. Tion,et al.  Reliability of the LENA TM Language Environment Analysis System in Young Children's Natural Home Environment , 2009 .

[14]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Ronald Rousseau,et al.  Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula , 1989, Inf. Process. Manag..

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[19]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[20]  Jun Du,et al.  A Progressive Deep Learning Approach to Child Speech Separation , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[21]  Alejandrina Cristià,et al.  Talker Diarization in the Wild: the Case of Child-centered Daylong Audio-recordings , 2018, INTERSPEECH.

[22]  Jun Du,et al.  Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers , 2014, The 9th International Symposium on Chinese Spoken Language Processing.