Significance of Vowel-Like Regions for Speaker Verification Under Degraded Conditions

Vowel-like regions (VLRs) in speech includes vowels, semi-vowels, and diphthong sound units. VLR can be identified using a vowel-like region onset point (VLROP) event. By production, the VLR has impulse-like excitation and therefore information about the vocal tract system may be better manifested in them. Also, the VLR is a relatively high signal-to-noise ratio (SNR) region. Speaker information extracted from such a region may therefore be more speaker discriminative and relatively less affected by the degradations like noise, reverberation, and sensor mismatches. Due to this, better speaker modeling and reliable testing may be possible. In this paper, VLRs are detected using the knowledge of VLROPs during training and testing. Features from the VLRs are then used for training and testing the speaker models. As a result, significant improvement in the performance is reported for speaker verification under degraded conditions.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[3]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[4]  S. R. Mahadeva Prasanna,et al.  Detection of vowel onset point events using excitation information , 2005, INTERSPEECH.

[5]  Javier Ortega-Garcia,et al.  Increasing robustness in GMM speaker recognition systems for noisy and reverberant speech with low complexity microphone arrays , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  D. J. Hermes,et al.  Vowel-onset detection. , 1990, The Journal of the Acoustical Society of America.

[7]  B. Yegnanarayana,et al.  Neural Networks based Approach for Detection of Vowel Onset Points , 1999 .

[8]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[9]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[10]  Mark J. F. Gales,et al.  Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Jirí Navrátil,et al.  Statistical model migration in speaker recognition , 2004, INTERSPEECH.

[12]  B. Yegnanarayana,et al.  Vowel onset point based variable frame rate analysis for speech recognition , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[13]  Larry P. Heck,et al.  A model-based transformational approach to robust speaker recognition , 2000, INTERSPEECH.

[14]  Bayya Yegnanarayana,et al.  Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[15]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[16]  Alvin F. Martin The NIST speaker recognition evaluations , 2012, Odyssey.

[17]  Hynek Hermansky,et al.  Speech enhancement using linear prediction residual , 1999, Speech Commun..

[18]  S. R. Mahadeva Prasanna,et al.  Extraction of pitch in adverse conditions , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  S. R. Mahadeva Prasanna,et al.  Enhancement of noisy speech by temporal and spectral processing , 2011, Speech Commun..

[20]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Yuan Dong,et al.  Svm-Based Speaker Verification by Location in the Space of Reference Speakers , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Yonghong Yan,et al.  Maximum a posteriori linear regression for speaker recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[24]  David A. van Leeuwen,et al.  NIST and NFI-TNO evaluations of automatic speaker recognition , 2006, Comput. Speech Lang..

[25]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[26]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[27]  Mike Brookes,et al.  Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[29]  P. Krishnamoorthy,et al.  Reverberant Speech Enhancement by Temporal and Spectral Processing , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  S. R. Mahadeva Prasanna,et al.  Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  S R M Prasanna,et al.  Multi-variability speech database for robust speaker recognition , 2011, 2011 National Conference on Communications (NCC).

[32]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  James R. Glass,et al.  Robust Speaker Recognition in Noisy Conditions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Martin J. Russell,et al.  Text-dependent speaker verification under noisy conditions using parallel model combination , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[35]  Bayya Yegnanarayana,et al.  Characterization of Glottal Activity From Speech Signals , 2009, IEEE Signal Processing Letters.

[36]  Hong Kook Kim,et al.  Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments , 2001, IEEE Trans. Speech Audio Process..

[37]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Mohamed Kamal Omar,et al.  Maximum conditional mutual information modeling for speaker verification , 2005, INTERSPEECH.

[39]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[40]  Chung-Hsien Wu,et al.  A hierarchical neural network model based on a C/V segmentation algorithm for isolated Mandarin speech recognition , 1991, IEEE Trans. Signal Process..