Improving speaker verification performance against long-term speaker variability

A specially-collected speech database to reflect the long-term speaker variability.F-ratio to determine the importance of speaker- and session-specific information.Frequency warping and filter-bank outputs weighting strategies for feature extraction. Speaker verification performance degrades when input speech is tested in different sessions over a long period of time chronologically. Common ways to alleviate the long-term impact on performance degradation are enrollment data augmentation, speaker model adaptation, and adapted verification thresholds. From a point of view in features of a pattern recognition system, robust features that are speaker-specific, and invariant with time and acoustic environments are preferred to deal with this long-term variability. In this paper, with a newly created speech database, CSLT-Chronos, specially collected to reflect the long-term speaker variability, we investigate the issues in the frequency domain by emphasizing higher discrimination for speaker-specific information and lower sensitivity to time-related, session-specific information. F-ratio is employed as a criterion to determine the figure of merit to judge the above two sets of information, and to find a compromise between them. Inspired by the feature extraction procedure of the traditional MFCC calculation, two emphasis strategies are explored when generating modified acoustic features, the pre-filtering frequency warping and the post-filtering filter-bank outputs weighting are used for speaker verification. Experiments show that the two proposed features outperformed the traditional MFCC on CSLT-Chronos. The proposed approach is also studied by using the NIST SRE 2008 database in a state-of-the-art, i-vector based architecture. Experimental results demonstrate the advantage of proposed features over MFCC in LDA and PLDA based i-vector systems.

[1]  Jianwu Dang,et al.  An investigation of dependencies between frequency components and speaker characteristics based on phoneme mean F-ratio contribution , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[2]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[3]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[4]  Stanley J. Wenndt,et al.  The multi-session audio research project (MARP) corpus: goals, design and initial findings , 2009, INTERSPEECH.

[5]  T. Kato,et al.  Improved speaker, verification over the cellular phone network using phoneme-balanced and digit-sequence-preserving connected digit patterns , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Sebastian Möller,et al.  Spectral Sub-band Analysis of Speaker Verification Employing Narrowband and Wideband Speech , 2014, Odyssey.

[7]  Douglas A. Reynolds,et al.  Person authentication by voice: a need for caution , 2003, INTERSPEECH.

[8]  Daniel Garcia-Romero,et al.  Linear versus mel frequency cepstral coefficients for speaker recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[9]  Finnian Kelly,et al.  Effects of Long-Term Ageing on Speaker Verification , 2011, BIOID.

[10]  Matthieu Hébert,et al.  Text-Dependent Speaker Recognition , 2008 .

[11]  Joan E Sussman,et al.  Changes in acoustic characteristics of the voice across the life span: measures from individuals 4-93 years of age. , 2011, Journal of speech, language, and hearing research : JSLHR.

[12]  Jianwu Dang,et al.  Physiological Feature Extraction for Text Independent Speaker Identification using Non-Uniform Subband Processing , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Thomas Fang Zheng,et al.  Improved context-dependent acoustic modeling for continuous Chinese speech recognition , 2001, INTERSPEECH.

[14]  Homayoon Beigi,et al.  Fundamentals of Speaker Recognition , 2011 .

[15]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Andrzej Drygajlo,et al.  Speaker verification with long-term ageing data , 2012, 2012 5th IAPR International Conference on Biometrics (ICB).

[18]  Jonathan Harrington,et al.  Vocal aging effects on F0 and the first formant: A longitudinal analysis in adult speakers , 2010, Speech Commun..

[19]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[20]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[21]  Philip Rose Forensic Speaker Identification , 2002 .

[22]  Thomas Fang Zheng,et al.  A tree-based kernel selection approach to efficient Gaussian mixture model-universal background model based speaker identification , 2006, Speech Commun..

[23]  Eduardo López Gonzalo,et al.  Mel, linear, and antimel frequency cepstral coefficients in broad phonetic regions for telephone speaker recognition , 2009, INTERSPEECH.

[24]  Andrzej Drygajlo,et al.  Compensating for Ageing and Quality variation in Speaker Verification , 2012, INTERSPEECH.

[25]  Jean-Luc Gauvain,et al.  Speaker verification over the telephone , 2000, Speech Commun..

[26]  Sadaoki Furui,et al.  Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Brett Y. Smolenski,et al.  Long term examination of intra-session and inter-session speaker variability , 2009, INTERSPEECH.

[28]  J. Markel,et al.  Text-independent speaker recognition from a large linguistically unconstrained time-spaced data base , 1979 .

[29]  Tomi Kinnunen,et al.  Designing a speaker-discriminative adaptive filter bank for speaker recognition , 2002, INTERSPEECH.

[30]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[33]  Levent M. Arslan,et al.  Frequency analysis of speaker identification , 2001, Odyssey.

[34]  Sebastian Möller,et al.  Advantages of wideband over narrowband channels for speaker verification employing MFCCs and LFCCs , 2014, INTERSPEECH.

[35]  Andrzej Drygajlo,et al.  Speaker verification in score-ageing-quality classification space , 2013, Comput. Speech Lang..

[36]  Roland Auckenthaler,et al.  Equalizing sub-band error rates in speaker recognition , 1997, EUROSPEECH.

[37]  Jay L. Devore,et al.  Probability and statistics for engineering and the sciences , 1982 .

[38]  Christopher Cieri,et al.  Greybeard Longitudinal Speech Study , 2010, LREC.

[39]  H. Beigi Effects of time lapse on Speaker Recognition results , 2009, 2009 16th International Conference on Digital Signal Processing.

[40]  Tomi Kinnunen,et al.  Spectral Features for Automatic Text-Independent Speaker Recognition , 2003 .

[41]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[42]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[44]  Ronald A. Cole,et al.  The CSLU speaker recognition corpus , 1998, ICSLP.

[45]  J. Wolf Efficient Acoustic Parameters for Speaker Recognition , 1972 .

[46]  Biing-Hwang Juang,et al.  A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Jean-François Bonastre,et al.  Subband Approach for Automatic Speaker Recognition: Optimal Division of the Frequency Domain , 1997, AVBPA.

[48]  Anil K. Bera,et al.  A test for normality of observations and regression residuals , 1987 .

[49]  Jianwu Dang,et al.  An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification , 2008, Speech Commun..

[50]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[51]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[52]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.