Speaker verification based on fusion of acoustic and articulatory information

We propose a practical, feature-level fusion approach for combining acoustic and articulatory information in speaker verification task. We find that concatenating articulation features obtained from the measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) improves the overall speaker verification performance. However, since access to the measured articulatory data is impractical for real world speaker verification applications, we also experiment with estimated articulatory features obtained using acoustic-to-articulatory inversion technique. Specifically, we show that augmenting MFCCs with articulatory features obtained from subject-independent acoustic-to-articulatory inversion technique also significantly enhances the speaker verification performance. This performance boost could be due to the information about inter-speaker variation present in the estimated articulatory features, especially at the mean and variance level. Experimental results on the Wisconsin X-Ray Microbeam database show that the proposed acoustic-estimatedarticulatory fusion approach significantly outperforms the traditional acoustic-only baseline, providing up to 10% relative reduction in Equal Error Rate (EER). We further show that we can achieve an additional 5% relative reduction in EER after score-level fusion. Index Terms: speech production, speaker verification, articulation features, acoustic-to-articulatory inversion, biometrics

[1]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[3]  Shrikanth Narayanan,et al.  A generalized smoothness criterion for acoustic-to-articulatory inversion. , 2010, The Journal of the Acoustical Society of America.

[4]  Masaaki Honda,et al.  Compensatory responses of articulators to unexpected perturbation of the palate shape , 2002, J. Phonetics.

[5]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[6]  Susanne Fuchs,et al.  An EMMA and EPG study on token-to-token variability , 2005 .

[7]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[9]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Susan A. Dart,et al.  WPP, No. 79: Articulatory and Acoustic Properties of Apical and Laminal Articulations , 1991 .

[11]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[12]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[15]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[17]  Shrikanth S. Narayanan,et al.  A subject-independent acoustic-to-articulatory inversion , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  K. Margaritis,et al.  A ROUGH GUIDE TO THE ACOUSTIC-TO-ARTICULATORY INVERSION OF SPEECH , 2003 .

[19]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[20]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[21]  Shrikanth S. Narayanan,et al.  Speaker verification using simplified and supervised i-vector modeling , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[23]  V. Vapnik,et al.  On the theory of learning with Privileged Information , 2010, NIPS 2010.

[24]  Shrikanth Narayanan,et al.  Interspeaker variability in hard palate morphology and vowel production. , 2013, Journal of speech, language, and hearing research : JSLHR.

[25]  S. Baum,et al.  Articulatory and acoustic adaptation to palatal perturbation. , 2011, The Journal of the Acoustical Society of America.

[26]  Athanasios Katsamanis,et al.  Morphological Variation in the Adult Vocal Tract: A Modeling Study of its Potential Acoustic Impact , 2011, INTERSPEECH.

[27]  Shrikanth Narayanan,et al.  Morphological variation in the adult hard palate and posterior pharyngeal wall. , 2013, Journal of speech, language, and hearing research : JSLHR.

[28]  Shrikanth Narayanan,et al.  SPEAKER VERIFICATION USING LASSO BASED SPARSE TOTAL VARIABILITY SUPERVECTOR AND PROBABILISTIC LINEAR DISCRIMINANT ANALYSIS , 2011 .

[29]  The NIST Year 2010 Speaker Recognition Evaluation Plan 1 I NTRODUCTION , 2022 .

[30]  Shrikanth S. Narayanan,et al.  Spatial and temporal alignment of multimodal human speech production data: Real time imaging, flesh point tracking and audio , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Pascal Perrier,et al.  On the relationship between palate shape and articulatory behavior. , 2009, The Journal of the Acoustical Society of America.

[32]  Yonghong Yan,et al.  Speaker Verification Using Sparse Representations on Total Variability i-vectors , 2011, INTERSPEECH.

[33]  Shrikanth S. Narayanan,et al.  Automatic Classification of Palatal and Pharyngeal Wall Shape Categories from Speech Acoustics and Inverted Articulatory Signals , 2013 .

[34]  Shrikanth S. Narayanan,et al.  Robust talking face video verification using joint factor analysis and sparse representation on GMM mean shifted supervectors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[36]  Trevor Darrell,et al.  Learning with Recursive Perceptual Representations , 2012, NIPS.

[37]  Pascal Perrier,et al.  The influence of the palate shape on articulatory token-to-token variability , 2005 .

[38]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[39]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.