Speaker Verification Using Short Utterances with DNN-Based Estimation of Subglottal Acoustic Features

Speaker verification in real-world applications sometimes deals with limited duration of enrollment and/or test data. MFCC-based i-vector systems have defined the state-of-the-art for speaker verification, but it is well known that they are less effective with short utterances. To address this issue, we propose a method to leverage the speaker specificity and stationarity of subglottal acoustics. First, we present a deep neural network (DNN) based approach to estimate subglottal features from speech signals. The approach involves training a DNN-regression model that maps the log filter-bank coefficients of a given speech signal to those of its corresponding subglottal signal. Cross-validation experiments on the WashU-UCLA corpus (which contains parallel recordings of speech and subglottal acoustics) show the effectiveness of our DNN-based estimation algorithm. The average correlation coefficient between the actual and estimated subglottal filter-bank coefficients is 0.9. A scorelevel fusion of MFCC and subglottal-feature systems in the ivector PLDA framework yields statistically-significant improvements over the MFCC-only baseline. On the NIST SRE 08 truncated 10sec-10sec and 5sec-5sec core evaluation tasks, the relative reduction in equal error rate ranges between 6 and 14% for the conditions tested with both microphone and telephone speech.

[1]  Sridha Sridharan,et al.  Factor analysis subspace estimation for speaker verification with short utterances , 2008, INTERSPEECH.

[2]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  DeLiang Wang,et al.  Deep neural network based spectral feature mapping for robust speech recognition , 2015, INTERSPEECH.

[4]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[5]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[6]  John H. L. Hansen,et al.  Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Abeer Alwan,et al.  Age-dependent height estimation and speaker normalization for children's speech using the first three subglottal resonances , 2015, INTERSPEECH.

[8]  Abeer Alwan,et al.  Speaker recognition via fusion of subglottal features and MFCCs , 2014, INTERSPEECH.

[9]  Petr Motlícek,et al.  Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Steven M. Lulich Subglottal resonances and distinctive features , 2010, J. Phonetics.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[13]  Abeer Alwan,et al.  Subglottal resonances of adult male and female native speakers of American English. , 2012, The Journal of the Acoustical Society of America.

[14]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[15]  Abeer Alwan,et al.  Automatic estimation of the first three subglottal resonances from adults' speech signals with application to speaker height estimation , 2013, Speech Commun..

[16]  Abeer Alwan,et al.  Non-linear frequency warping for VTLN using subglottal resonances and the third formant frequency , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Sridha Sridharan,et al.  Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques , 2014, Speech Commun..