Exemplar-Based Processing for Speech Recognition: An Overview

Solving real-world classification and recognition problems requires a principled way of modeling the physical phenomena generating the observed data and the uncertainty in it. The uncertainty originates from the fact that many data generation aspects are influenced by nondirectly measurable variables or are too complex to model and hence are treated as random fluctuations. For example, in speech production, uncertainty could arise from vocal tract variations among different people or corruption by noise. The goal of modeling is to establish a generalization from the set of observed data such that accurate inference (classification, decision, recognition) can be made about the data yet to be observed, which we refer to as unseen data.

[1]  Georg Heigold,et al.  Speech recognition with state-based nearest neighbour classifiers , 2007, INTERSPEECH.

[2]  Patrick Wambacq,et al.  Evaluating acoustic distance measures for template based recognition , 2007, INTERSPEECH.

[3]  W. Bastiaan Kleijn,et al.  On the Estimation of Differential Entropy From Data Located on Embedded Manifolds , 2007, IEEE Transactions on Information Theory.

[4]  Stan Z. Li,et al.  Content-based audio classification and retrieval using the nearest feature line method , 2000, IEEE Trans. Speech Audio Process..

[5]  Patrick Wambacq,et al.  Data driven example based continuous speech recognition , 2003, INTERSPEECH.

[6]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[7]  Louis ten Bosch,et al.  Using sparse representations for exemplar based continuous digit recognition , 2009, 2009 17th European Signal Processing Conference.

[8]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[9]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[10]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[11]  Jerome R. Bellegarda,et al.  Latent perceptual mapping: a new acoustic modeling framework for speech recognition , 2010, INTERSPEECH.

[12]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[13]  Constantine Kotropoulos,et al.  Music genre classification via sparse representations of auditory temporal modulations , 2009, 2009 17th European Signal Processing Conference.

[14]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Hervé Bourlard,et al.  Neural networks for statistical recognition of continuous speech , 1995, Proc. IEEE.

[16]  Dirk Van Compernolle,et al.  HEAR: an hybrid episodic-abstract speech recognizer , 2009, INTERSPEECH.

[17]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[18]  Hanan Samet 3. Similarity searching: Indexing, nearest neighbor finding, dimensionality reduction, and embedding methods for applications in multimedia databases , 2008, ICPR 2008.

[19]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[20]  P. Laguna,et al.  Signal Processing , 2002, Yearbook of Medical Informatics.

[21]  Mikkel N. Schmidt,et al.  Linear Regression on Sparse Features for Single-Channel Speech Separation , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[22]  Douglas D. O'Shaughnessy,et al.  Context-independent phoneme recognition using a K-Nearest Neighbour classification approach , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Michael Elad,et al.  Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing , 2010 .

[24]  Louis ten Bosch,et al.  Improvements of a dual-input DBN for noise robust ASR , 2011, Interspeech 2011.

[25]  J.R. Bellegarda,et al.  Latent semantic mapping [information retrieval] , 2005, IEEE Signal Processing Magazine.

[26]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[27]  Shrikanth S. Narayanan,et al.  Audio retrieval by latent perceptual indexing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Tara N. Sainath,et al.  Reducing Computational Complexities of Exemplar-Based Sparse Representations with Applications to Large Vocabulary Speech Recognition , 2011, INTERSPEECH.

[29]  Hugo Van hamme,et al.  HAC-models: a novel approach to continuous speech recognition , 2008, INTERSPEECH.

[30]  Hugo Van hamme,et al.  Progress in example based automatic speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Roberto Brunelli,et al.  Face Recognition: Features Versus Templates , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Geoffrey Zweig,et al.  Integrating meta-information into exemplar-based speech recognition with segmental conditional random fields , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Tara N. Sainath,et al.  Sparse representation features for speech recognition , 2010, INTERSPEECH.

[34]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Patrick Wambacq,et al.  A locally weighted distance measure for example based speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Mathias De Wachter,et al.  Example based continuous speech recognition , 2007 .

[37]  Tara N. Sainath,et al.  Exemplar-based Sparse Representation phone identification features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Tara N. Sainath,et al.  A convex hull approach to sparse representations for exemplar-based speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[39]  Rémi Gribonval,et al.  Sparse Representations in Audio and Music: From Coding to Source Separation , 2010, Proceedings of the IEEE.

[40]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[41]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[42]  Bert Cranen,et al.  Sparse imputation for noise robust speech recognition using soft masks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Alex Acero,et al.  Factored adaptation for separable compensation of speaker and environmental variability , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[44]  Mikhail Belkin,et al.  Using manifold structure for partially labelled classification , 2002, NIPS 2002.

[45]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[46]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[47]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[48]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Shrikanth S. Narayanan,et al.  Classification of sound clips by two schemes: Using onomatopoeia and semantic labels , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[50]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[52]  S. Mallat A wavelet tour of signal processing , 1998 .

[53]  Jerome R. Bellegarda,et al.  Latent perceptual mapping with data-driven variable-length acoustic units for template-based speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Michael Collins,et al.  Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition , 2009, NIPS.

[55]  Bert Cranen,et al.  Sparse imputation for large vocabulary noise robust ASR , 2011, Comput. Speech Lang..

[56]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[57]  Daniel P. W. Ellis,et al.  Audio-Based Semantic Concept Classification for Consumer Video , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Iapr Governing Board,et al.  Image Computing for Digital Pathology Similarity Searching: Indexing, Nearest Neighbor Finding, Dimensionality Reduction, and Embedding Methods for Applications in Multimedia Databases , 2008 .

[59]  Gérard G. Medioni,et al.  Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting , 2010, J. Mach. Learn. Res..

[60]  Jithendra Vepa,et al.  Using posterior-based features in template matching for speech recognition , 2006, INTERSPEECH.

[61]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[62]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[63]  Bhiksha Raj,et al.  A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds , 2009, NIPS.

[64]  Tuomas Virtanen,et al.  Toward a practical implementation of exemplar-based noise robust ASR , 2011, 2011 19th European Signal Processing Conference.