Constrained temporal structure for text-dependent speaker verification

In the context of mobile devices, speaker recognition engines may suffer from ergonomic constraints and limited amount of computing resources. Even if they prove their efficiency in classical contexts, GMM/UBM systems show their limitations when restricting the quantity of speech data. In contrast, the proposed GMM/UBM extension addresses situations characterised by limited enrolment data and only the computing power typically found on modern mobile devices. A key contribution comes from the harnessing of the temporal structure of speech using client-customised pass-phrases and new Markov model structures. Additional temporal information is then used to enhance discrimination with Viterbi decoding, increasing the gap between client and imposter scores. Experiments on the MyIdea database are presented with a standard GMM/UBM configuration acting as a benchmark. When imposters do not know the client pass-phrase, a relative gain of up to 65% in terms of EER is achieved over the GMM/UBM baseline configuration. The results clearly highlight the potential of this new approach, with a good balance between complexity and recognition accuracy.

[1]  Douglas E. Sturim,et al.  Speaker verification using text-constrained Gaussian Mixture Models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  John S. D. Mason,et al.  A segmental mixture model for speaker recognition , 2001, INTERSPEECH.

[3]  Andreas Stolcke,et al.  Speaker recognition with region-constrained MLLR transforms , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jong Kyoung Kim,et al.  Speech recognition , 1983, 1983 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[5]  John S. D. Mason,et al.  Short utterance-based video aided speaker recognition , 2008, 2008 IEEE 10th Workshop on Multimedia Signal Processing.

[6]  Steve Young,et al.  The general use of tying in phoneme-based HMM speech recognisers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Néstor Becerra Yoma,et al.  Robust speaker verification with state duration modeling , 2002, Speech Commun..

[8]  Georges Linarès,et al.  Chapter 7 EMBEDDED MOBILE PHONE DIGIT-RECOGNITION , 2007 .

[9]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[10]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[11]  Alvin F. Martin,et al.  NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels , 2009, INTERSPEECH.

[12]  Amitava Das,et al.  Text-Dependent Speaker-Recognition Using One-Pass Dynamic Programming Algorithm , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[13]  Matthieu Hébert,et al.  Text-Dependent Speaker Recognition , 2008 .

[14]  John S. D. Mason,et al.  Reinforced temporal structure information for embedded utterance-based speaker recognition , 2008, INTERSPEECH.

[15]  Bernd Freisleben,et al.  Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[16]  Bayya Yegnanarayana,et al.  Exploring subsegmental and suprasegmental features for a text-dependent speaker verification in distant speech signals , 2010, INTERSPEECH.

[17]  S. R. Mahadeva Prasanna,et al.  Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system , 2005, IEEE Transactions on Speech and Audio Processing.

[18]  John S. D. Mason,et al.  Constrained Viterbi decoding for embedded user-customised password speaker recognition , 2010, SAC '10.

[19]  Douglas D. O'Shaughnessy,et al.  Comparative Evaluation of Feature Normalization Techniques for Speaker Verification , 2011, NOLISP.

[20]  Sridha Sridharan,et al.  PLDA based speaker recognition on short utterances , 2012, Odyssey.

[21]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Hagai Aronowitz,et al.  Text dependent speaker verification using a small development set , 2012, Odyssey.

[23]  Qingyang Hong,et al.  GMM-UBM for text-dependent speaker recognition , 2012, 2012 International Conference on Audio, Language and Image Processing.

[24]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[25]  Aaron E. Rosenberg,et al.  Connected word talker verification using whole word hidden Markov models , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[26]  Georges Linarès,et al.  Embedded Mobile Phone Digit-Recognition , 2007 .

[27]  Haizhou Li,et al.  ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition , 2013, INTERSPEECH.

[28]  Mark E. Forsyth Discriminating observation probability (DOP) HMM for speaker verification , 1995, Speech Commun..

[29]  Herbert Gish,et al.  Speaker identification via support vector classifiers , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  Sridha Sridharan,et al.  Minimising Speaker Verification Utterance Length through Confidence Based Early Verification Decisions , 2009, ICB.

[31]  Larry P. Heck,et al.  Phonetic class-based speaker verification , 2003, INTERSPEECH.

[32]  Sadaoki Furui,et al.  Concatenated phoneme models for text-variable speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Georges Linarès,et al.  Phoneme Lattice Based A* Search Algorithm for Speech Recognition , 2002, TSD.

[34]  T. Kato,et al.  Improved speaker, verification over the cellular phone network using phoneme-balanced and digit-sequence-preserving connected digit patterns , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[35]  Luis A. Hernández Gómez,et al.  Phoneme and sub-phoneme t-normalization for text-dependent speaker recognition , 2008, Odyssey.

[36]  Hervé Bourlard,et al.  User-customized password speaker verification using multiple reference and background models , 2006, Speech Commun..

[37]  Nicholas W. D. Evans,et al.  Improving the performance of text-independent short duration SVM- and GMM-based speaker verification , 2008, Odyssey.

[38]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[39]  Jean-Claude Junqua,et al.  Gaussian dynamic warping (GDW) method applied to text-dependent speaker detection and verification , 2003, INTERSPEECH.

[40]  Nicholas W. D. Evans,et al.  Influence of task duration in text-independent speaker verification , 2007, INTERSPEECH.

[41]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[42]  Haizhou Li,et al.  I-vectors in the context of phonetically-constrained short utterances for speaker verification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Eliathamby Ambikairajah,et al.  A segment selection technique for speaker verification , 2010, Speech Commun..

[44]  Rolf Ingold,et al.  MYIDEA - MULTIMODAL BIOMETRICS DATABASE, DESCRIPTION OF ACQUISITION PROTOCOLS , 2005 .

[45]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[46]  Hui Jiang,et al.  Normalization and Transformation Techniques for Robust Speaker Recognition , 2008 .

[47]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[49]  Jing Li,et al.  Support vector machines based text dependent speaker verification using HMM supervectors , 2008, Odyssey.