Using Auxiliary Sources of Knowledge for Automatic Speech Recognition

Standard hidden Markov model (HMM) based automatic speech recognition (ASR) systems usually use cepstral features as acoustic observation and phonemes as subword units. Speech signal exhibits wide range of variability such as, due to environmental variation, speaker variation. This leads to different kinds of mismatch, such as, mismatch between acoustic features and acoustic models or mismatch between acoustic features and pronunciation models (given the acoustic models). The main focus of this work is on integrating auxiliary knowledge sources into standard ASR systems so as to make the acoustic models more robust to the variabilities in the speech signal. We refer to the sources of knowledge that are able to provide additional information about the sources of variability as auxiliary sources of knowledge. The auxiliary knowledge sources that have been primarily investigated in the present work are auxiliary features and auxiliary subword units. Auxiliary features are secondary source of information that are outside of the standard cepstral features. They can be estimation from the speech signal (e.g., pitch frequency, short-term energy and rate-of-speech), or additional measurements (e.g., articulator positions or visual information). They are correlated to the standard acoustic features, and thus can aid in estimating better acoustic models, which would be more robust to variabilities present in the speech signal. The auxiliary features that have been investigated are pitch frequency, short-term energy and rate-of-speech. These features can be modelled in standard ASR either by concatenating them to the standard acoustic feature vectors or by using them to condition the emission distribution (as done in gender-based acoustic modelling). We have studied these two approaches within the framework of hybrid HMM/artificial neural networks based ASR, dynamic Bayesian network based ASR and TANDEM system on different ASR tasks. Our studies show that by modelling auxiliary features along with standard acoustic features the performance of the ASR system can be improved in both clean and noisy conditions. We have also proposed an approach to evaluate the adequacy of the baseform pronunciation model of words. This approach allows us to compare between different acoustic models as well as to extract pronunciation variants. Through the proposed approach to evaluate baseform pronunciation model, we show that the matching and discriminative properties of single baseform pronunciation can be improved by integrating auxiliary knowledge sources in standard ASR. Standard ASR systems use usually phonemes as the subword units in a Markov chain to model words. In the present thesis, we also study a system where word models are described by two parallel chains of subword units: one for phonemes and the other are for graphemes (phoneme-grapheme based ASR). Models for both types of subword units are jointly learned using maximum likelihood training. During recognition, decoding is performed using either or both of the subword unit chains. In doing so, we thus have used graphemes as auxiliary subword units. The main advantage of using graphemes is that the word models can be defined easily using the orthographic transcription, thus being relatively noise free as compared to word models based upon phoneme units. At the same time, there are drawbacks to using graphemes as subword units, since there is a weak correspondence between the grapheme and the phoneme in languages such as English. Experimental studies conducted for American English on different ASR tasks have shown that the proposed phoneme-grapheme based ASR system can perform better than the standard ASR system that uses only phonemes as its subword units. Furthermore, while modelling context-dependent graphemes (similar to context-dependent phonemes), we observed that context-dependent graphemes behave like phonemes. ASR studies conducted on different tasks showed that by modelling context-dependent graphemes only (without any phonetic information) performance competitive to the state-of-the-art context-dependent phoneme-based ASR system can be obtained.

[1]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[2]  Timothy J. Hazen The use of speaker correlation information for automatic speech recognition , 1998 .

[3]  Hervé Bourlard,et al.  On the Use of Information Retrieval Measures for Speech Recognition Evaluation , 2004 .

[4]  Hynek Hermansky,et al.  Nonlinear spectral transformations for robust speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[5]  Dennis H. Klatt,et al.  Review of the ARPA speech understanding project , 1990 .

[6]  Heiga Zen,et al.  Trajectory modeling based on HMMs with the explicit relationship between static and dynamic features , 2003, INTERSPEECH.

[7]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[8]  Hervé Bourlard,et al.  Speech recognition with auxiliary information , 2004, IEEE Transactions on Speech and Audio Processing.

[9]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[10]  Mark Huckvale,et al.  WHY HAVE HMMS BEEN SO SUCCESSFUL FOR AUTOMATIC SPEECH RECOGNITION AND HOW MIGHT THEY BE IMPROVED , 1994 .

[11]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  Paul C. Bagshaw,et al.  Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching , 1993, EUROSPEECH.

[13]  N. G. Zagoruyko,et al.  Automatic recognition of 200 words , 1970 .

[14]  D. Reddy Computer recognition of connected speech. , 1967, The Journal of the Acoustical Society of America.

[15]  Nelson Morgan,et al.  "Ignorance-based" systems , 1984, ICASSP.

[16]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[17]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[18]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[19]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[20]  Samy Bengio,et al.  Significance Tests for \em Bizarre Measures in 2-Class Classification Tasks , 2004 .

[21]  James R. Glass,et al.  Hidden feature models for speech recognition using dynamic Bayesian networks , 2003, INTERSPEECH.

[22]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[23]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[24]  Hynek Hermansky,et al.  DESIRED CHARACTERISTICS OF MODULATION SPECTRUM FOR ROBUST AUTOMATIC SPEECH RECOGNITION , 1998 .

[25]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[26]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[27]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[28]  Daniel P. W. Ellis,et al.  PLP2: Autoregressive modeling of auditory-like 2-D spectro-temporal patterns , 2004 .

[29]  Hervé Bourlard,et al.  Mel-cepstrum modulation spectrum (MCMS) features for robust ASR , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[30]  Atsunori Ogawa,et al.  Estimating entropy of a language from optimal word insertion penalty , 1998, ICSLP.

[31]  D. Howard,et al.  Speech and audio signal processing: processing and perception of speech and music [Book Review] , 2000 .

[32]  Hynek Hermansky,et al.  On use of task independent training data in tandem feature extraction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[34]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[35]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[36]  Joseph Picone,et al.  Syllable-based large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[37]  Yochai Konig,et al.  GDNN: a gender-dependent neural network for continuous speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[38]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[39]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  Hynek Hermansky,et al.  Phoneme vs Grapheme Based Automatic Speech Recognition , 2004 .

[41]  Hervé Bourlard,et al.  Auxiliary variables in conditional Gaussian mixtures for automatic speech recognition , 2002, INTERSPEECH.

[42]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[43]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[44]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[45]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[46]  Rajesh M. Hegde,et al.  Segmentation of speech into syllable-like units , 2003, INTERSPEECH.

[47]  Steve Renals,et al.  Confidence measures from local posterior probability estimates , 1999, Comput. Speech Lang..

[48]  Hynek Hermansky,et al.  Analysis and synthesis of speech based on spectral transform linear predictive method , 1983, ICASSP.

[49]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[50]  W. J. Langford Statistical Methods , 1959, Nature.

[51]  Guillaume Lathoud,et al.  A sector-based, frequency-domain approach to detection and localization of multiple speakers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[52]  Harald Singer,et al.  Pitch dependent phone modelling for HMM based speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Tanja Schultz,et al.  Grapheme based speech recognition , 2003, INTERSPEECH.

[54]  Hervé Bourlard,et al.  Automatic Speech Recognition using Pitch Information in Dynamic Bayesian Networks , 2000 .

[55]  Hermann Ney,et al.  Progress in dynamic programming search for LVCSR , 2000 .

[56]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[57]  V.W. Zue,et al.  The use of speech knowledge in automatic speech recognition , 1985, Proceedings of the IEEE.

[58]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[59]  Hervé Bourlard,et al.  Dynamic Bayesian network based speech recognition with pitch and energy as auxiliary variables , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[60]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[62]  Hervé Bourlard,et al.  Hybrid HMM/ANN systems for training independent tasks: experiments on Phonebook and related improvements , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[64]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[65]  Hervé Bourlard,et al.  Pronunciation models and their evaluation using confidence measures , 2001 .

[66]  S. Bengio,et al.  Phoneme-grapheme based speech recognition system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[67]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[68]  Steve Young,et al.  The general use of tying in phoneme-based HMM speech recognisers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[69]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[70]  Thomas Gold,et al.  Hearing , 1953, Trans. IRE Prof. Group Inf. Theory.

[71]  Ronald A. Cole,et al.  Automatic time alignment of phonemes using acoustic-phonetic information , 2000 .

[72]  Hervé Bourlard,et al.  Spectro-temporal activity pattern (STAP) features for noise robust ASR , 2004, INTERSPEECH.

[73]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[74]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[75]  Mari Ostendorf,et al.  Joint lexicon, acoustic unit inventory and model design , 1999, Speech Commun..

[76]  Maxine D. Brown,et al.  Continuous connected word recognition using whole word templates , 1983 .

[77]  Hervé Bourlard,et al.  On the Adequacy of Baseform Pronunciations and Pronunciation Variants , 2004, MLMI.

[78]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[79]  Hervé Bourlard,et al.  Speech recognition of spontaneous, noisy speech using auxiliary information in Bayesian networks , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[80]  Douglas D. O'Shaughnessy,et al.  Robust gender-dependent acoustic-phonetic modelling in continuous speech recognition based on a new automatic male/female classification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[81]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[82]  S.E. Levinson,et al.  Structural methods in automatic speech recognition , 1985, Proceedings of the IEEE.

[83]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[84]  Sanjeev Khudanpur,et al.  Pronunciation modeling for conversational speech recognition , 2001 .

[85]  Jeff A. Bilmes Graphical models and automatic speech recognition , 2002 .

[86]  Helmer Strik,et al.  A data-driven method for modeling pronunciation variation , 2003, Speech Commun..

[87]  Hervé Bourlard,et al.  Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems , 1997, EUROSPEECH.

[88]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[89]  Hong C. Leung,et al.  PhoneBook: a phonetically-rich isolated-word telephone-speech database , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[90]  Phil D. Green,et al.  From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition , 2004, INTERSPEECH.

[91]  Jeff A. Bilmes,et al.  Natural statistical models for automatic speech recognition , 1999 .

[92]  Daniel Tapias Merino,et al.  Towards speech rate independence in large vocabulary continuous speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[93]  Hervé Bourlard,et al.  Mixed Bayesian networks with auxiliary variables for automatic speech recognition , 2002, Object recognition supported by user interaction for service robots.

[94]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[95]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .

[96]  Mari Ostendorf,et al.  Multi-rate and variable-rate modeling of speech at phone and syllable time scales [speech recognition applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[97]  Steffen L. Lauritzen,et al.  Stable local computation with conditional Gaussian distributions , 2001, Stat. Comput..

[98]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[99]  Hynek Hermansky TRAP-TANDEM: data-driven extraction of temporal features from speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[100]  Heinrich Niemann,et al.  Automatic speech recognition without phonemes , 1993, EUROSPEECH.

[101]  Hervé Bourlard,et al.  An introduction to the hybrid hmm/connectionist approach , 1995 .

[102]  Hervé Bourlard,et al.  Using pitch frequency information in speech recognition , 2003, INTERSPEECH.

[103]  Bertrand Mesot,et al.  A Frequency-Domain Silence Noise Model , 2005 .

[104]  Simon King,et al.  Asynchronous Articulatory Feature Recognition Using Dynamic Bayesian Networks , 2004 .

[105]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[106]  Nikki Mirghafori,et al.  Combining connectionist multi-band and full-band probability streams for speech recognition of natural numbers , 1998, ICSLP.

[107]  Yoshua Bengio,et al.  Neural Network - Gaussian Mixture Hybrid for Speech Recognition or Density Estimation , 1991, NIPS.

[108]  Hervé Bourlard,et al.  Modeling auxiliary information in Bayesian network based ASR , 2001, INTERSPEECH.

[109]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[110]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[111]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[112]  Hisao Kuwabara Acoustic and perceptual properties of phonemes in continuous speech as a function of speaking rate , 1997, EUROSPEECH.

[113]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[114]  Paul Dalsgaard,et al.  Multi-lingual label alignment using acoustic-phonetic features derived by neural-network technique , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[115]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[116]  Eric Fosler-Lussier,et al.  Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[117]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[118]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[119]  Andrej Ljolje,et al.  High accuracy phone recognition using context clustering and quasi-triphonic models , 1994, Comput. Speech Lang..

[120]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner , 2003, INTERSPEECH.

[121]  Astrid Hagen Robust speech recognition based on multi-stream processing , 2001 .

[122]  Samy Bengio,et al.  Towards using hierarchical posteriors for flexible automatic speech recognition systems , 2004 .

[123]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[124]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[125]  Beth Logan,et al.  Factorial Hidden Markov Models for Speech Recognition: Preliminary Experiments , 1997 .

[126]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[127]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[128]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[129]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[130]  Hervé Bourlard,et al.  Phase autocorrelation (PAC) derived robust speech features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[131]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[132]  Helmer Strik,et al.  Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation , 1999, Speech Commun..

[133]  Hynek Hermansky,et al.  Improving Continuous Speech Recognition System Performance with Grapheme Modelling , 2005 .

[134]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[135]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[136]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[137]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[138]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[139]  Shigeki Sagayama,et al.  Multiple-regression hidden Markov model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[140]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[141]  Geoffrey Zweig,et al.  Structurally discriminative graphical models for automatic speech recognition - results from the 2001 Johns Hopkins Summer Workshop , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[142]  Stella M. O'Brien Knowledge-Based Systems in Speech Recognition: A Survey , 1993, Int. J. Man Mach. Stud..

[143]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[144]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[145]  Hervé Bourlard,et al.  Modelling auxiliary information (pitch frequency) in hybrid HMM/ANN based ASR systems , 2002 .

[146]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[147]  Matthew A. Siegler,et al.  Measuring and Compensating for the Effects of Speech Rate in Large Vocabulary Continuous Speech Recognition , 1995 .

[148]  F. Jelinek Fast sequential decoding algorithm using a stack , 1969 .

[149]  P. V. S. Rao,et al.  Pre-recognition measures of speaking rate , 1998, Speech Commun..

[150]  Bruce T. Lowerre,et al.  The HARPY speech recognition system , 1976 .

[151]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[152]  Peter Beyerlein,et al.  Discriminative model combination , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[153]  Stephanie Seneff,et al.  Lexical stress modeling for improved speech recognition of spontaneous telephone speech in the jupiter domain , 2001, INTERSPEECH.

[154]  Jeff A. Bilmes,et al.  DBN based multi-stream models for speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[155]  Victor Lesser,et al.  Organization of the Hearsay II speech understanding system , 1975 .

[156]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[157]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[158]  Hermann Ney,et al.  Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[159]  Samy Bengio,et al.  Joint decoding for phoneme-grapheme continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[160]  Hervé Bourlard,et al.  MODELLING AUXILIARY FEATURES in TANDEM SYSTEMS , 2004 .

[161]  Neri Merhav,et al.  Hidden Markov modeling using the most likely state sequence , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[162]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[163]  Hervé Bourlard,et al.  HMM/ANN based spectral peak location estimation for noise robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[164]  Hervé Bourlard,et al.  CDNN: a context dependent neural network for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[165]  Joseph B. Kruskal,et al.  Time Warps, String Edits, and Macromolecules , 1999 .

[166]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[167]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[168]  Douglas D. O'Shaughnessy,et al.  Interacting with computers by voice: automatic speech recognition and synthesis , 2003, Proc. IEEE.

[169]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[170]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[171]  R. Cole,et al.  TELEPHONE SPEECH CORPUS DEVELOPMENT AT CSLU , 1998 .

[172]  Sidney C. Port,et al.  Probability, Random Variables, and Stochastic Processes—Second Edition (Athanasios Papoulis) , 1986 .

[173]  A. Lilienfeld,et al.  What else is new? An historical excursion. , 1977, American journal of epidemiology.

[174]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[175]  Stephen Cox,et al.  RecNorm: Simultaneous Normalisation and Classification Applied to Speech Recognition , 1990, NIPS.

[176]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[177]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[178]  N. Merhav,et al.  Hidden Markov modeling using a dominant state sequence with application to speech recognition , 1991 .

[179]  Steve Renals Radial basis function network for speech pattern classification , 1989 .