论文信息 - Speaker Verification Using Short Utterances with DNN-Based Estimation of Subglottal Acoustic Features

Speaker Verification Using Short Utterances with DNN-Based Estimation of Subglottal Acoustic Features

Speaker verification in real-world applications sometimes deals with limited duration of enrollment and/or test data. MFCC-based i-vector systems have defined the state-of-the-art for speaker verification, but it is well known that they are less effective with short utterances. To address this issue, we propose a method to leverage the speaker specificity and stationarity of subglottal acoustics. First, we present a deep neural network (DNN) based approach to estimate subglottal features from speech signals. The approach involves training a DNN-regression model that maps the log filter-bank coefficients of a given speech signal to those of its corresponding subglottal signal. Cross-validation experiments on the WashU-UCLA corpus (which contains parallel recordings of speech and subglottal acoustics) show the effectiveness of our DNN-based estimation algorithm. The average correlation coefficient between the actual and estimated subglottal filter-bank coefficients is 0.9. A scorelevel fusion of MFCC and subglottal-feature systems in the ivector PLDA framework yields statistically-significant improvements over the MFCC-only baseline. On the NIST SRE 08 truncated 10sec-10sec and 5sec-5sec core evaluation tasks, the relative reduction in equal error rate ranges between 6 and 14% for the conditions tested with both microphone and telephone speech.

[1] Sridha Sridharan,et al. Factor analysis subspace estimation for speaker verification with short utterances , 2008, INTERSPEECH.

[2] Themos Stafylakis,et al. PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] DeLiang Wang,et al. Deep neural network based spectral feature mapping for robust speech recognition , 2015, INTERSPEECH.

[4] Keinosuke Fukunaga,et al. Introduction to Statistical Pattern Recognition , 1972 .

[5] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[6] John H. L. Hansen,et al. Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Abeer Alwan,et al. Age-dependent height estimation and speaker normalization for children's speech using the first three subglottal resonances , 2015, INTERSPEECH.

[8] Abeer Alwan,et al. Speaker recognition via fusion of subglottal features and MFCCs , 2014, INTERSPEECH.

[9] Petr Motlícek,et al. Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Steven M. Lulich. Subglottal resonances and distinctive features , 2010, J. Phonetics.

[11] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[12] Sridha Sridharan,et al. i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[13] Abeer Alwan,et al. Subglottal resonances of adult male and female native speakers of American English. , 2012, The Journal of the Acoustical Society of America.

[14] Florin Curelaru,et al. Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[15] Abeer Alwan,et al. Automatic estimation of the first three subglottal resonances from adults' speech signals with application to speaker height estimation , 2013, Speech Commun..

[16] Abeer Alwan,et al. Non-linear frequency warping for VTLN using subglottal resonances and the third formant frequency , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Sridha Sridharan,et al. Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques , 2014, Speech Commun..