The use of discrete distributions with a very large codebook for automatic speech recognition and speaker verification

With the advance of semiconductor technology and the popularity of distributed speech/speaker recognition paradigm (e.g., Siri in iPhone4s), here we revisit the use of discrete model in automatic speech recognition (ASR) and speaker verification (SV) tasks. Compared with the dominant continuous density model, discrete model has inherently attractive properties: it uses non-parametric output distributions and takes only O(1) time to get the probability value from it; furthermore, the features used in the discrete model, compared with that in the continuous model, could be encoded in fewer bits, lowering the bandwidth requirement in distributed speech/speaker recognition architecture. Unfortunately, the recognition performance of a conventional discrete model is significantly worse than that of a continuous one due to the large quantization error and the use of multiple independent streams. In this thesis, we propose to reduce the quantization error of a discrete model by using a very large codebook with tens of thousands of codewords. The codebook of the proposed model is about a hundred times larger than that of a conventional discrete model, whose codebook size usually ranges from 256 to 1024. Accordingly, the number of parameters to specify a discrete output distribution grows by a hundred times in the proposed model. Compared with a discrete model of conventional sized codebook, there are two major challenges in building a very large codebook model. Firstly, given a continuous acoustic feature vector, how do we quickly find its corresponding codeword from a hundred-time larger codebook? Secondly, given the limited amount of training data, how can we robustly train such a high-density model, which has a hundred times more parameters than the conventional model? To find a codeword for an acoustic vector fast, we employ the subvector-quantized (SVQ) codebooks. SVQ codebooks represent a very large codebook in the full feature space by a combinatorial product of per-subvector smaller codebooks. To find a full space codeword is reduced to finding a set of SVQ codewords, which is very fast. To robustly train such a high-density model, two techniques are explored. The first one is to do model conversion. A discrete model is converted directly from a well-trained continuous model and avoids direct training using the training data. The second one is by subspace modeling. In this technique, the original high-density discrete distribution table is treated a high dimensional vector and assumed to lie in some low dimensional subspace. By this subspace representation, the number of free parameters in the model is reduced by ten and hundred fold. As a result, the model could be trained robustly using the limited amount of data. Experimental evaluations on both ASR and SV tasks show the feasibility and benefits of the very large codebook discrete model. On the WSJ0 (Wall Street Journal) ASR task, the proposed model shows comparable recognition accuracy as the continuous model with much faster decoding and lower bandwidth requirement. On the NIST (National Institute of Standards and Technology) 2002 SV task, a speedup of 8-25 fold is achieved with almost no loss in verification performance.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  Satoshi Takahashi,et al.  Discrete mixture HMM , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[4]  Y.-L. Chow Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[5]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[6]  J. Makhoul Spectral analysis of speech by linear prediction , 1973 .

[7]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[8]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9]  Yunxin Zhao,et al.  Integrate template matching and statistical modeling for speech recognition , 2010, INTERSPEECH.

[10]  Christian Kohlschein An introduction to Hidden Markov Models , 2007 .

[11]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[12]  Hugo Van hamme,et al.  Progress in example based automatic speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jonathan Trumbull Foote,et al.  Decision-tree probability modeling for HMM speech recognition , 1994 .

[14]  Masami Akamine,et al.  Decision tree acoustic models for ASR , 2009, INTERSPEECH.

[15]  Alexander I. Rudnicky,et al.  Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems , 2004, INTERSPEECH.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Kay-Fu Lee,et al.  Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[18]  Gérard Chollet,et al.  Combining GMM's with suport vector machines for text-independent speaker verification , 2001, INTERSPEECH.

[19]  Mari Ostendorf,et al.  Joint quantizer design and parameter estimation for discrete hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[20]  Mei-Yuh Hwang,et al.  Deleted interpolation and density sharing for continuous hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[21]  Toby Berger,et al.  Efficient text-independent speaker verification with structural Gaussian mixture models and neural network , 2003, IEEE Trans. Speech Audio Process..

[22]  Masami Akamine,et al.  HMM-based speech recognition using decision trees instead of GMMs , 2007, INTERSPEECH.

[23]  Christoph Neukirchen,et al.  A continuous density interpretation of discrete HMM systems and MMI-neural networks , 2001, IEEE Trans. Speech Audio Process..

[24]  Vassilios Digalakis,et al.  Reviving discrete HMMs: the myth about the superiority of continuous HMMs , 1999, EUROSPEECH.

[25]  Douglas A. Reynolds,et al.  Comparison of background normalization methods for text-independent speaker verification , 1997, EUROSPEECH.

[26]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[27]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[28]  W. Chen,et al.  Tree-structured vector quantization for speech recognition , 2000, Comput. Speech Lang..

[29]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[30]  Robert M. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[31]  Steve Young,et al.  The HTK book , 1995 .

[32]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[33]  Masami Akamine,et al.  Speech recognition using soft decision trees , 2008, INTERSPEECH.

[34]  Tara N. Sainath,et al.  Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Brian Kan-Wing Mak,et al.  High-density discrete HMM with the use of scalar quantization indexing , 2005, INTERSPEECH.

[36]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[37]  Chong Kwan Un,et al.  On estimating robust probability distribution in HMM-based speech recognition , 1995, IEEE Trans. Speech Audio Process..

[38]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[39]  Roland Auckenthaler,et al.  Gaussian selection applied to text-independent speaker verification , 2001, Odyssey.

[40]  Brian Kan-Wing Mak,et al.  Subspace distribution clustering hidden Markov model , 2001, IEEE Trans. Speech Audio Process..

[41]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[42]  T. G. Coleman,et al.  Numerical Integration , 2019, Numerical Methods for Engineering An introduction using MATLAB® and computational electromagnetics examples.

[43]  K.F. Lee,et al.  On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition , 1993, IEEE Trans. Speech Audio Process..

[44]  Alvin F. Martin,et al.  NIST speaker recognition evaluation chronicles , 2004, Odyssey.

[45]  Kuldip K. Paliwal,et al.  Efficient vector quantization of LPC parameters at 24 bits/frame , 1993, IEEE Trans. Speech Audio Process..

[46]  Vassilios Digalakis,et al.  Quantization of cepstral parameters for speech recognition over the World Wide Web , 1999, IEEE J. Sel. Areas Commun..

[47]  Wu Chou,et al.  Robust decision tree state tying for continuous speech recognition , 2000, IEEE Trans. Speech Audio Process..

[48]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Brian Kan-Wing Mak,et al.  Discriminative training by iterative linear programming optimization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Masami Akamine,et al.  Decision tree-based acoustic models for speech recognition , 2012, EURASIP J. Audio Speech Music. Process..

[51]  Kai Feng,et al.  SUBSPACE GAUSSIAN MIXTURE MODELS FOR SPEECH RECOGNITION , 2009 .

[52]  Vassilios Digalakis,et al.  Efficient speech recognition using subvector quantization and discrete-mixture HMMS , 2000, Comput. Speech Lang..

[53]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[54]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[55]  Guoli Ye,et al.  Subvector-quantized high-density discrete hidden Markov model and its re-estimation , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[56]  Javier Hernando,et al.  Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[57]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[58]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[59]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[60]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[61]  Brian Kan-Wing Mak,et al.  Automatic estimation of decoding parameters using large-margin iterative linear programming , 2009, INTERSPEECH.

[62]  Chris Barry,et al.  Robust smoothing methods for discrete hidden Markov models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[63]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[64]  Jesús A. De Loera,et al.  Software for exact integration of polynomials over polyhedra , 2011, ACCA.

[65]  Alex Acero,et al.  Towards a non-parametric acoustic model: an acoustic decision tree for observation probability calculation , 2008, INTERSPEECH.

[66]  Darren Pearce,et al.  Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities , 2000 .

[67]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech signals , 1990 .

[68]  Douglas A. Reynolds,et al.  A study of computation speed-UPS of the GMM-UBM speaker recognition system , 1999, EUROSPEECH.

[69]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[70]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[71]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[72]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[73]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[74]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter modeling for speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[75]  Sadaoki Furui,et al.  Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[76]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[77]  Mei-Yuh Hwang,et al.  Shared-distribution hidden Markov models for speech recognition , 1993, IEEE Trans. Speech Audio Process..

[78]  C. Weinstein,et al.  A system for acoustic-phonetic analysis of continuous speech , 1975 .

[79]  David G. Stork,et al.  Pattern Classification , 1973 .

[80]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[81]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[82]  Mark J. F. Gales,et al.  State-based Gaussian selection in large vocabulary continuous speech recognition using HMMs , 1999, IEEE Trans. Speech Audio Process..

[83]  Sun-Yuan Kung,et al.  Maximum Likelihood and Maximum a Posteriori Adaptation for Distributed Speaker Recognition Systems , 2004, ICBA.

[84]  Jinyu Li,et al.  Shrinkage model adaptation in automatic speech recognition , 2010, INTERSPEECH.

[85]  Yik Lun,et al.  DISCRIMINATIVE TRAINING OF STREAM WEIGHTS IN A MULTI-STREAM HMM AS A LINEAR PROGRAMMING PROBLEM , 2008 .