Speaker recognition via fusion of subglottal features and MFCCs

Motivated by the speaker-specificity and stationarity of subglottal acoustics, this paper investigates the utility of subglottal cepstral coefficients (SGCCs) for speaker identification (SID) and verification (SV). SGCCs can be computed using accelerometer recordings of subglottal acoustics, but such an approach is infeasible in real-world scenarios. To estimate SGCCs from speech signals, we adopt the Bayesian minimum mean squared error (MMSE) estimator proposed in the speech-to-articulatory inversion literature. The joint distribution of SGCCs and speech MFCCs is modeled using the WashU-UCLA corpus (containing simultaneous recordings of speech and subglottal acoustics), and the resulting model is used to obtain an MMSE estimate of SGCCs from unseen (test) MFCCs. Cross-validation experiments on the WashU-UCLA corpus show that the estimation efficacy, on average, is speaker dependent. A score-level fusion of MFCC and SGCC systems outperforms the MFCC-only baseline in both SID and SV tasks. On the TIMIT database (SID), the relative reduction in identification error is 16, 40 and 51% for G.712-filtered (300–3400 Hz), narrowband (0–4000 Hz) and wideband (0–8000 Hz) speech, respectively. On the NIST 2008 database (SV), the relative reduction in equal error rate is 4 and 11% for 10 and 5 second utterances, respectively.

[1]  Abeer Alwan,et al.  Non-linear frequency warping for VTLN using subglottal resonances and the third formant frequency , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Shrikanth S. Narayanan,et al.  Speaker verification based on fusion of acoustic and articulatory information , 2013, INTERSPEECH.

[3]  Abeer Alwan,et al.  Automatic estimation of the first three subglottal resonances from adults' speech signals with application to speaker height estimation , 2013, Speech Commun..

[4]  George R. Doddington,et al.  Speaker recognition based on idiolectal differences between speakers , 2001, INTERSPEECH.

[5]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[6]  Abeer Alwan,et al.  Automatic detection of the second subglottal resonance and its application to speaker normalization. , 2009, The Journal of the Acoustical Society of America.

[7]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[8]  Abeer Alwan,et al.  A new speech corpus for studying subglottal acoustics in speech production, perception, and technology. , 2010 .

[9]  H. Pasterkamp,et al.  Tracheal sound spectra depend on body height. , 1993, The American review of respiratory disease.

[10]  Shrikanth S. Narayanan,et al.  A subject-independent acoustic-to-articulatory inversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Alvin F. Martin,et al.  NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels , 2009, INTERSPEECH.

[12]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[13]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[16]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[17]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[18]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[19]  Haizhou Li,et al.  A GMM supervector Kernel with the Bhattacharyya distance for SVM based speaker recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[21]  Tomi Kinnunen Joint Acoustic-Modulation Frequency for Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[22]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[23]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.