Formant-gaps Features for Speaker Verification Using Whispered Speech

In this work, we propose a new feature based on formants for whispered speaker verification (SV) task, where neutral data is used for enrollment and whispered recordings are used for test. Such a mismatch between enrollment and test often degrades the performance of whispered SV systems due to the difference in acoustic characteristics of whispered and neutral speech. We hypothesize that the proposed formant and formant gap (F oG) features are more invariant to the modes of speech in capturing speaker specific information compared to traditional baseline features for SV including mel frequency cepstral coefficients (MFCC) and auditory-inspired amplitude modulation features (AAMF). Whispered SV experiments with 714 speakers comprising 29232 neutral and 22932 whispered recordings reveal that the equal error rate (EER) using the proposed features is lower than that using the best baseline features by ~3.79% (absolute). It was also observed that at least four whispered recordings during enrollment are required for the baseline features to perform at par with the proposed features. However, it was found that the best performing baseline features yield an EER for neutral SV task which is ~1.88% higher than that using the proposed features.

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  Michael Jessen,et al.  Long‐term formant distribution as a forensic‐phonetic feature. , 2010 .

[3]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[4]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[5]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[6]  Prasanta Kumar Ghosh,et al.  A comparative study of acoustic-to-articulatory inversion for neutral and whispered speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yan Song,et al.  Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation , 2015, ACM Trans. Access. Comput..

[8]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[9]  Ling Guan,et al.  Recognizing Human Emotional State From Audiovisual Signals , 2008, IEEE Transactions on Multimedia.

[10]  V. Tartter What's in a whisper? , 1989, The Journal of the Acoustical Society of America.

[11]  Kirsty McDougall,et al.  Dynamic features of speech and the characterization of speakers: Toward a new approach using formant frequencies , 2006 .

[12]  Juraj Simko,et al.  The CHAINS corpus: CHAracterizing INdividual Speakers , 2006 .

[13]  Pascale Fung,et al.  Fast accent identification and accented speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[14]  Philip Rose,et al.  Forensic speaker recognition in Chinese: a multivariate likelihood ratio discrimination on /i/ and /y/ , 2008, INTERSPEECH.

[15]  Christophe d'Alessandro,et al.  Improved differential phase spectrum processing for formant tracking , 2004, INTERSPEECH.

[16]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Michael Jessen,et al.  Forensic speaker verification using formant features and Gaussian mixture models , 2008, INTERSPEECH.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  John H. L. Hansen,et al.  Speaker identification for whispered speech using modified temporal patterns and MFCCs , 2009, INTERSPEECH.

[20]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[21]  Paavo Alku,et al.  Speaker recognition from whispered speech: A tutorial survey and an application of time-varying linear prediction , 2018, Speech Commun..

[22]  John H. L. Hansen,et al.  Acoustic analysis and feature transformation from neutral to whisper for speaker identification within whispered speech audio streams , 2013, Speech Commun..

[23]  David Poeppel,et al.  Physiological evidence for auditory modulation filterbanks: cortical responses to concurrent modulations. , 2013, The Journal of the Acoustical Society of America.

[24]  John H. L. Hansen,et al.  Speaker identification for whispered speech based on frequency warping and score competition , 2008, INTERSPEECH.

[25]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[26]  Tiago H. Falk,et al.  Fusion of bottleneck, spectral and modulation spectral features for improved speaker verification of neutral and whispered speech , 2018, Speech Commun..

[27]  Tiago H. Falk,et al.  Fusion of auditory inspired amplitude modulation spectrum and cepstral features for whispered and normal speech speaker verification , 2017, Comput. Speech Lang..

[28]  Joaquín González-Rodríguez,et al.  Linguistically-constrained formant-based i-vectors for automatic speaker recognition , 2016, Speech Commun..

[29]  Milton Sarria-Paja,et al.  Strategies to Enhance Whispered Speech Speaker Verification: A Comparative Analysis , 2015 .

[30]  John H. L. Hansen,et al.  Analysis and classification of speech mode: whispered through shouted , 2007, INTERSPEECH.

[31]  Douglas D. O'Shaughnessy,et al.  Feature mapping, score-, and feature-level fusion for improved normal and whispered speech speaker verification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[33]  John H. L. Hansen,et al.  Speaker identification with whispered speech based on modified LFCC parameters and feature mapping , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[35]  Boon Pang Lim,et al.  Computational differences between whispered and non-whispered speech , 2011 .

[36]  V. K. Mittal,et al.  Study of the effects of vocal tract constriction on glottal vibration. , 2014, The Journal of the Acoustical Society of America.

[37]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[38]  Francis Nolan,et al.  A case for formant analysis in forensic speaker identification , 2005 .

[39]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  S. Adler Speech after Laryngectomy , 1969, The American journal of nursing.

[41]  John H. L. Hansen,et al.  Speaker Identification Within Whispered Speech Audio Streams , 2011, IEEE Transactions on Audio, Speech, and Language Processing.