Speech recognition with auxiliary information

State-of-the-art automatic speech recognition (ASR) systems are usually based on hidden Markov models (HMMs) that emit cepstral-based features which are assumed to be piecewise stationary. While not really robust to noise, these features are also known to be very sensitive to "auxiliary" information, such as pitch, energy, rate-of-speech (ROS), etc. Attempts so far to include such auxiliary information in state-of-the-art ASR systems have often been based on simply appending these auxiliary features to the standard acoustic feature vectors. In the present paper, we investigate different approaches to incorporating this auxiliary information using dynamic Bayesian networks (DBNs) or hybrid HMM/ANNs (HMMs with artificial neural networks). These approaches are motivated by the fact that the auxiliary information is not necessarily (directly) emitted by the HMM states but, rather, carries higher-level information (e.g., speaker characteristics) that is correlated with the standard features. As implicitly done for gender modeling elsewhere, this auxiliary information then appears as a conditional variable in the emission distributions and can be hidden (except in the case of some HMM/ANNs) as its estimates become too noisy. Based on recognition experiments carried out on the OGI Numbers database (free format numbers spoken over the telephone), we show that auxiliary information that conditions the distribution of the standard features can, in certain conditions, provide more robust recognition than using auxiliary information that is appended to the standard features; this is most evident in the case of energy as an auxiliary variable in noisy speech.

[1]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[2]  Simon King,et al.  Sixth International Conference on Spoken Language Processing (ICSLP 2000) , 2000 .

[3]  Hervé Bourlard,et al.  Using pitch frequency information in speech recognition , 2003, INTERSPEECH.

[4]  Hervé Bourlard,et al.  Auxiliary variables in conditional Gaussian mixtures for automatic speech recognition , 2002, INTERSPEECH.

[5]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[6]  Todd Andrew Stephenson,et al.  An Introduction to Bayesian Network Theory and Usage , 2000 .

[7]  Hervé Bourlard,et al.  Hybrid HMM/ANN systems for training independent tasks: experiments on Phonebook and related improvements , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Nikki Mirghafori,et al.  Combining connectionist multi-band and full-band probability streams for speech recognition of natural numbers , 1998, ICSLP.

[9]  Eric Fosler-Lussier,et al.  Speech recognition using on-line estimation of speaking rate , 1997, EUROSPEECH.

[10]  Hervé Bourlard,et al.  Modelling auxiliary information (pitch frequency) in hybrid HMM/ANN based ASR systems , 2002 .

[11]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[12]  Eric Fosler-Lussier,et al.  Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[13]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[14]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[15]  F. Jelinek Fast sequential decoding algorithm using a stack , 1969 .

[16]  Khalid Daoudi,et al.  Structural learning of dynamic Bayesian networks in speech recognition , 2001, INTERSPEECH.

[17]  Hervé Bourlard,et al.  Modeling auxiliary information in Bayesian network based ASR , 2001, INTERSPEECH.

[18]  Daniel Tapias Merino,et al.  Characteristics of slow, average and fast speech and their effects in large vocabulary continuous speech recognition , 1997, EUROSPEECH.

[19]  Jeff A. Bilmes,et al.  Data-driven extensions to HMM statistical dependencies , 1998, ICSLP.

[20]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[21]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[22]  Uri Lerner,et al.  Inference in Hybrid Networks: Theoretical Limits and Practical Algorithms , 2001, UAI.

[23]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[24]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[25]  Hynek Hermansky,et al.  Perceptually based linear predictive analysis of speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[27]  A. Cantoni Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98, Seattle, Washington, USA, May 12-15, 1998 , 1998, ICASSP.

[28]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Stephanie Seneff,et al.  Lexical stress modeling for improved speech recognition of spontaneous telephone speech in the jupiter domain , 2001, INTERSPEECH.

[30]  Joe Whittaker,et al.  Edge Exclusion Tests for Graphical Gaussian Models , 1999, Learning in Graphical Models.

[31]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[32]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[33]  N. Wermuth,et al.  Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative , 1989 .

[34]  Hervé Bourlard,et al.  Automatic Speech Recognition using Pitch Information in Dynamic Bayesian Networks , 2000 .

[35]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[36]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[37]  Steve Young,et al.  The HTK book , 1995 .

[38]  Oliver D. Anderson Graph Theory in Operations Research , 1982 .

[39]  Robert E. Tarjan,et al.  Simple Linear-Time Algorithms to Test Chordality of Graphs, Test Acyclicity of Hypergraphs, and Selectively Reduce Acyclic Hypergraphs , 1984, SIAM J. Comput..

[40]  Rajesh M. Hegde,et al.  Segmentation of speech into syllable-like units , 2003, INTERSPEECH.

[41]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[42]  Richard E. Neapolitan,et al.  Probabilistic reasoning in expert systems - theory and algorithms , 2012 .

[43]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[44]  S. Lauritzen Propagation of Probabilities, Means, and Variances in Mixed Graphical Association Models , 1992 .

[45]  Fabrice Plante,et al.  A pitch extraction reference database , 1995, EUROSPEECH.

[46]  Hervé Bourlard,et al.  Dynamic Bayesian network based speech recognition with pitch and energy as auxiliary variables , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[47]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[48]  Simon King,et al.  Speech recognition via phonetically featured syllables , 1998, ICSLP.

[49]  Yochai Konig,et al.  GDNN: a gender-dependent neural network for continuous speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[50]  Mikko Kurimo,et al.  Language model adaptation in speech recognition using document maps , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[51]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[52]  Daniel Tapias Merino,et al.  Towards speech rate independence in large vocabulary continuous speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[53]  Simon King,et al.  An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces , 2000, INTERSPEECH.

[54]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[55]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[56]  Hervé Bourlard,et al.  Speech recognition of spontaneous, noisy speech using auxiliary information in Bayesian networks , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[57]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[58]  Ross D. Shachter Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) , 1998, UAI.

[59]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[60]  J. Alexander,et al.  Theory and Methods: Critical Essays in Human Geography , 2008 .

[61]  李幼升,et al.  Ph , 1989 .

[62]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[63]  Phil D. Green,et al.  Some solution to the missing feature problem in data classification, with application to noise robust ASR , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[64]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[65]  Samy Bengio,et al.  Automatic speech recognition using dynamic bayesian networks with both acoustic and articulatory variables , 2000, INTERSPEECH.

[66]  Jaume Escofet,et al.  Automatic Speech Recognition using Dynamic Bayesian Networks with the Energy as an Auxiliary Variable , 2003 .

[67]  Frank J. Owens Signal processing of speech , 1993 .

[68]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[69]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[70]  Steffen L. Lauritzen,et al.  Stable local computation with conditional Gaussian distributions , 2001, Stat. Comput..

[71]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[72]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[73]  Michael I. Jordan Graphical Models , 2003 .

[74]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[75]  Timothy J. Hazen The use of speaker correlation information for automatic speech recognition , 1998 .

[76]  Hynek Hermansky,et al.  Nonlinear spectral transformations for robust speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[77]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[78]  Kenneth P. Bogart,et al.  Introductory Combinatorics , 1977 .

[79]  Todd Andrew Stephenson Conditional Gaussian Mixtures , 2003 .

[80]  Sacha Krstulovic Speech analysis with production constraints , 2001 .

[81]  Dragomir Anguelov,et al.  A General Algorithm for Approximate Inference and Its Application to Hybrid Bayes Nets , 1999, UAI.

[82]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[83]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[84]  Todd A. Stephenson Artificial Neural Networks in Recognition of Phonetic Features in Speech , 1998 .

[85]  R. Cole,et al.  TELEPHONE SPEECH CORPUS DEVELOPMENT AT CSLU , 1998 .

[86]  Hervé Bourlard,et al.  Mixed Bayesian networks with auxiliary variables for automatic speech recognition , 2002, Object recognition supported by user interaction for service robots.

[87]  Astrid Hagen Robust speech recognition based on multi-stream processing , 2001 .

[88]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[89]  Beth Logan,et al.  Factorial HMMs for acoustic modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[90]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[91]  M. Golumbic Algorithmic graph theory and perfect graphs , 1980 .

[92]  Shigeki Sagayama,et al.  Multiple-regression hidden Markov model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[93]  Jan Larsen,et al.  Neural networks for signal processing XII : proceedings of the 2002 IEEE Signal Processing Society Workshop : twelfth in a series of workshops , 2002 .

[94]  Stan Z. Li,et al.  Markov Random Field Modeling in Computer Vision , 1995, Computer Science Workbench.

[95]  Hermann Ney,et al.  Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[96]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[97]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[98]  Hisao Kuwabara Acoustic and perceptual properties of phonemes in continuous speech as a function of speaking rate , 1997, EUROSPEECH.

[99]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[100]  Hiromitsu Kumamoto,et al.  Random sampling approach to state estimation in switching environments , 1977, Autom..

[101]  Steve Renals,et al.  Start-synchronous search for large vocabulary continuous speech recognition , 1999, IEEE Trans. Speech Audio Process..

[102]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[103]  Geoffrey Zweig,et al.  Dependency modeling with bayesian networks in a voicemail transcription system , 1999, EUROSPEECH.

[104]  David Madigan,et al.  Probabilistic Temporal Reasoning , 2005, Handbook of Temporal Reasoning in Artificial Intelligence.

[105]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[106]  Douglas B. Paul,et al.  An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[107]  Paul A. Lynn,et al.  Signal Processing of Speech (Macmillan New Electronics) , 1993 .

[108]  I. Zlokarnik Adding articulatory features to acoustic features for automatic speech recognition , 1995 .

[109]  Hervé Bourlard,et al.  Phase autocorrelation (PAC) derived robust speech features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[110]  Gregory F. Cooper,et al.  Learning Hybrid Bayesian Networks from Data , 1999, Learning in Graphical Models.

[111]  Richard M. Stern,et al.  On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[112]  Hong C. Leung,et al.  PhoneBook: a phonetically-rich isolated-word telephone-speech database , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[113]  Jeff A. Bilmes,et al.  Natural statistical models for automatic speech recognition , 1999 .