Speaker and Language Recognition Techniques

In this work we give an overview of different state-of-the-art speaker and language recognition systems. We analyze some techniques to extract and model features from the acoustic signal and to model the speech content by means of phonetic decoding. We then present state-of-the-art generative systems based on latent variable models and discriminative techniques based on Support Vector Machines. We also present the author's contributions to the field. These contributions cover the different topics presented in this work. First we propose an improvement to Neural Network training for speech decoding which is based on the use of General Purpose Graphic Processing Units computational framework. We also propose adaptations of latent variable models developed for speaker recognition to the field of language identification. A novel technique which enhances the generation of low-dimensional utterance representations for speaker verification is also presented. Finally, we give a detailed analysis of different training algorithms for SVM-based speaker verification and we propose a novel discriminative framework for speaker verification, the Pairwise SVM approach, which allows for fast utterance testing and allows to achieve very good recognition performance

[1]  Lukás Burget,et al.  Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[3]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[4]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  Pietro Laface,et al.  Pairwise Discriminative Speaker Verification in the ${\rm I}$-Vector Space , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[8]  Pietro Laface,et al.  Acoustic-phonetic modeling for flexible vocabulary speech recognition , 1995, EUROSPEECH.

[9]  Pietro Laface,et al.  Language recognition using language factors , 2009, INTERSPEECH.

[10]  Pietro Laface,et al.  Channel Factors Compensation in Model and Feature Domain for Speaker Recognition , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[11]  Davide Anguita,et al.  An efficient implementation of BP on RISC-based workstations , 1994, Neurocomputing.

[12]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[13]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[14]  Pietro Laface,et al.  Compensation of Nuisance Factors for Speaker and Language Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Sören Sonnenburg,et al.  Optimized cutting plane algorithm for support vector machines , 2008, ICML '08.

[16]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[17]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[18]  Najim Dehak,et al.  Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification , 2009 .

[19]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[20]  Roberto Gemello,et al.  Hybrid HMM-NN modeling of stationary-transitional units for continuous speech recognition , 2000, Inf. Sci..

[21]  Lukás Burget,et al.  Discriminative classifiers for phonotactic language recognition with iVectors , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  P. J. Narayanan,et al.  High Performance Pattern Recognition on GPU , 2008 .

[23]  Pietro Laface,et al.  Analysis of Large-Scale SVM Training Algorithms for Language and Speaker Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[25]  William M. Campbell,et al.  High-level speaker verification with support vector machines , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[27]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[28]  Pietro Laface,et al.  Comparison of Speaker Recognition Approaches for Real Applications , 2011, INTERSPEECH.

[29]  Lukás Burget,et al.  Discriminative acoustic language recognition via channel-compensated GMM statistics , 2009, INTERSPEECH.

[30]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[31]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[32]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[33]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[34]  Lukás Burget,et al.  BUT language recognition system for NIST 2007 evaluations , 2008, INTERSPEECH.

[35]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[36]  Pietro Laface,et al.  Loquendo-Politecnico di Torino system for the 2009 NIST Language Recognition Evaluation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Pietro Laface,et al.  Comparison of Large-scale SVM Training Algorithms for Language Recognition , 2010, Odyssey.

[38]  William Campbell Weighted Nuisance Attribute Projection , 2010, Odyssey.

[39]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[40]  Lukás Burget,et al.  iVector Approach to Phonotactic Language Recognition , 2011, INTERSPEECH.

[41]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[43]  Charles Elkan,et al.  Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[44]  Alexander J. Smola,et al.  A scalable modular convex solver for regularized risk minimization , 2007, KDD '07.

[45]  James Demmel,et al.  Using PHiPAC to speed error back-propagation learning , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Biing-Hwang Juang,et al.  Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.

[47]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[48]  Paul D. Gader,et al.  Generalized hidden Markov models. I. Theoretical frameworks , 2000, IEEE Trans. Fuzzy Syst..

[49]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[50]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[51]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Pietro Laface,et al.  Fast discriminative speaker verification in the i-vector space , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  William M. Campbell,et al.  Language recognition with discriminative keyword selection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[55]  Andreas Stolcke,et al.  Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[56]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[57]  Andreas Stolcke,et al.  Improved phonetic speaker recognition using lattice decoding , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[58]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[59]  William M. Campbell,et al.  Language Recognition with Word Lattices and Support Vector Machines , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[60]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[61]  Keechul Jung,et al.  Neural Network Implementation Using CUDA and OpenMP , 2008, 2008 Digital Image Computing: Techniques and Applications.

[62]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[63]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[64]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[65]  Lukás Burget,et al.  Support vector machines and Joint Factor Analysis for speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  Patrick Kenny,et al.  New MAP estimators for speaker recognition , 2003, INTERSPEECH.

[67]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[68]  Pietro Laface,et al.  Gender independent discriminative speaker recognition in i-vector space , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[70]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[71]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[72]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[73]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[74]  P. Gader,et al.  Generalized Hidden Markov Models — Part I : Theoretical Frameworks , 2008 .

[75]  Pietro Laface,et al.  Parallel implementation of artificial neural network training , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[76]  Pietro Laface,et al.  Acoustic language identification using fast discriminative training , 2007, INTERSPEECH.

[77]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[78]  William M. Campbell,et al.  Channel compensation for SVM speaker recognition , 2004, Odyssey.

[79]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[80]  Lukás Burget,et al.  Prosodic speaker verification using subspace multinomial models with intersession compensation , 2010, INTERSPEECH.

[81]  M. Do Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models , 2003, IEEE Signal Processing Letters.

[82]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[83]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[84]  William M. Campbell A covariance kernel for svm language recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[85]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[86]  Michal Cernanský Training Recurrent Neural Network Using Multistream Extended Kalman Filter on Multicore Processor and Cuda Enabled Graphic Processor Unit , 2009, ICANN.

[87]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[88]  Jan Vaněk,et al.  UWB system description for NIST SRE 2010 , 2010 .

[89]  Lukás Burget,et al.  Comparison of scoring methods used in speaker recognition with Joint Factor Analysis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[90]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[91]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[92]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[93]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[94]  Sean Borman,et al.  The Expectation Maximization Algorithm A short tutorial , 2006 .

[95]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[96]  Douglas E. Sturim,et al.  Speaker adaptive cohort selection for Tnorm in text-independent speaker verification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[97]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[98]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[99]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[100]  G. Meurant Computer Solution of Large Linear Systems , 1999 .

[101]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[102]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[103]  Thorsten Joachims,et al.  Sparse kernel SVMs via cutting-plane training , 2009, Machine Learning.

[104]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[105]  Biing-Hwang Juang,et al.  The use of cohort normalized scores for speaker verification , 1992, ICSLP.

[106]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[107]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[108]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[109]  William M. Campbell,et al.  Towards reduced false-alarms using cohorts , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[110]  Douglas A. Reynolds,et al.  Comparison of background normalization methods for text-independent speaker verification , 1997, EUROSPEECH.

[111]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[112]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[113]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .