Front end analysis of speech recognition: a review

Automatic speech recognition (ASR) has made great strides with the development of digital signal processing hardware and software. But despite of all these advances, machines can not match the performance of their human counterparts in terms of accuracy and speed, especially in case of speaker independent speech recognition. So, today significant portion of speech recognition research is focused on speaker independent speech recognition problem. Before recognition, speech processing has to be carried out to get a feature vectors of the signal. So, front end analysis plays a important role. The reasons are its wide range of applications, and limitations of available techniques of speech recognition. So, in this report we briefly discuss the different aspects of front end analysis of speech recognition including sound characteristics, feature extraction techniques, spectral representations of the speech signal etc. We have also discussed the various advantages and disadvantages of each feature extraction technique, along with the suitability of each method to particular application.

[1]  S. Tamura,et al.  An analysis of a noise reduction neural network , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[3]  Joshua B. Tenenbaum,et al.  Mapping a Manifold of Perceptual Observations , 1997, NIPS.

[4]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[5]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[6]  M. Sondhi,et al.  New methods of pitch extraction , 1968 .

[7]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[8]  Kilian Q. Weinberger,et al.  Nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization , 2005, AISTATS.

[9]  N. Huang,et al.  The Mechanism for Frequency Downshift in Nonlinear Wave Evolution , 1996 .

[10]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[11]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[12]  Joseph Picone The demographics of speaker independent digit recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  H. Zha,et al.  Principal manifolds and nonlinear dimensionality reduction via tangent space alignment , 2004, SIAM J. Sci. Comput..

[14]  Johan A. K. Suykens,et al.  Data Visualization and Dimensionality Reduction Using Kernel Maps With a Reference Point , 2008, IEEE Transactions on Neural Networks.

[15]  Reinhold Huber-Mörk,et al.  Classification of coins using an eigenspace approach , 2005, Pattern Recognit. Lett..

[16]  Jakob J. Verbeek,et al.  Learning nonlinear image manifolds by global alignment of local linear models , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[18]  Dimitris K. Agrafiotis,et al.  Stochastic proximity embedding , 2003, J. Comput. Chem..

[19]  Aggelos K. Katsaggelos,et al.  Applications of Artificial Neural Networks in Image Processing III , 1998 .

[20]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[21]  Panos E. Papamichalis,et al.  Practical approaches to speech coding , 1987 .

[22]  R. J. Lickley,et al.  Proceedings of the International Conference on Spoken Language Processing. , 1992 .

[23]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[24]  George R. Doddington Phonetically sensitive discriminants for improved speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[25]  B Gold,et al.  Parallel processing techniques for estimating pitch periods of speech in the time domain. , 1969, The Journal of the Acoustical Society of America.

[26]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[27]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[28]  Joseph P. Campbell,et al.  A comparison of US Government standard voice coders , 1989, IEEE Military Communications Conference, 'Bridging the Gap. Interoperability, Survivability, Security'.

[29]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[30]  George R. Doddington,et al.  Frame-specific statistical features for speaker independent speech recognition , 1986, IEEE Trans. Acoust. Speech Signal Process..

[31]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[32]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[33]  Biing-Hwang Juang,et al.  On the use of bandpass liftering in speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[34]  David A. Landgrebe,et al.  Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[35]  Heiko Hoffmann,et al.  Kernel PCA for novelty detection , 2007, Pattern Recognit..

[36]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[37]  W. T. Peake,et al.  Experiments in Hearing , 1963 .

[38]  Raj Reddy,et al.  Automatic Speech Recognition: The Development of the Sphinx Recognition System , 1988 .

[39]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[40]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[41]  Joseph Picone,et al.  Design and implementation of a robust pitch detector based on a parallel processing technique , 1988, IEEE J. Sel. Areas Commun..

[42]  D. B. Paul A speaker-stress resistant HMM isolated word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Deli Zhao,et al.  Linear local tangent space alignment and application to face recognition , 2007, Neurocomputing.

[45]  B. Nadler,et al.  Diffusion maps, spectral clustering and reaction coordinates of dynamical systems , 2005, math/0503445.

[46]  Michael E. Tipping Sparse Kernel Principal Component Analysis , 2000, NIPS.

[47]  N. Huang,et al.  A study of the characteristics of white noise using the empirical mode decomposition method , 2004, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[48]  Katsuhiko Ogata,et al.  Modern Control Engineering , 1970 .

[49]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[50]  Jay G. Wilpon,et al.  Speech recognition: From the laboratory to the real world , 1990, AT&T Technical Journal.

[51]  J. Pickles An Introduction to the Physiology of Hearing , 1982 .

[52]  Joseph Picone,et al.  Voice across America: Toward robust speaker-independent speech recognition for telecommunications applications , 1991, Digit. Signal Process..

[53]  Chin-Hui Lee,et al.  Application of hidden Markov models for recognition of a limited set of words in unconstrained speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[54]  Matthew Brand,et al.  Charting a Manifold , 2002, NIPS.

[55]  Joseph P. Campbell,et al.  Voiced/Unvoiced classification of speech with applications to the U.S. government LPC-10E algorithm , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[56]  Bernard Gold,et al.  Note on Buzz‐Hiss Detection , 1964 .

[57]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[58]  A. Noll Problems of speech recognition in mobile environments , 1990, ICSLP.

[59]  J. Allen,et al.  Cochlear modeling , 1985, IEEE ASSP Magazine.

[60]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[61]  Heiga Zen,et al.  On the Use of Kernel PCA for Feature Extraction in Speech Recognition , 2003, IEICE Trans. Inf. Syst..

[62]  E. Zwicker,et al.  Analytical expressions for critical‐band rate and critical bandwidth as a function of frequency , 1980 .

[63]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[64]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[65]  Garrison W. Cottrell,et al.  Non-Linear Dimensionality Reduction , 1992, NIPS.

[66]  Rafael A. Calvo,et al.  Fast Dimensionality Reduction and Simple PCA , 1998, Intell. Data Anal..

[67]  Lawrence K. Saul,et al.  Analysis and extension of spectral methods for nonlinear dimensionality reduction , 2005, ICML.

[68]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[69]  Geoffrey E. Hinton,et al.  Global Coordination of Local Linear Models , 2001, NIPS.

[70]  S. S. Shen,et al.  A confidence limit for the empirical mode decomposition and Hilbert spectral analysis , 2003, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[71]  G. Békésy,et al.  Experiments in Hearing , 1963 .

[72]  Marcus Dätig,et al.  Performance and limitations of the Hilbert–Huang transformation (HHT) with an application to irregular water waves , 2004 .

[73]  Kari Torkkola,et al.  Linear Discriminant Analysis in Document Classification , 2007 .

[74]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[75]  Joydeep Ghosh,et al.  Principal curves for nonlinear feature extraction and classification , 1998, Electronic Imaging.

[76]  D. B. Paul,et al.  The Lincoln robust continuous speech recognizer , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[77]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[78]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[79]  Katsuhiko Shirai,et al.  Speaker adaptable phoneme recognition selecting reliable acoustic features based on mutual information , 1990, ICSLP.

[80]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[81]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[82]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals , 1983 .

[83]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[84]  L. Rabiner,et al.  System for automatic formant analysis of voiced speech. , 1970, The Journal of the Acoustical Society of America.

[85]  Raj Reddy,et al.  Automatic Speech Recognition: The Development of the SPHINX System , 2013 .

[86]  A. Alwan Perceptual cues for place of articulation for the voiced pharyngeal and uvular consonants , 1989 .

[87]  Hermann Dr Ney,et al.  Experiments on mixture-density phoneme-modelling for the speaker-independent 1000-word speech recognition DARPA task , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[88]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[89]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[90]  John G. Proakis,et al.  Digital Communications , 1983 .

[91]  George R. Doddington,et al.  Robust pitch detection in a noisy telephone environment , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[92]  Mathura S Venkatarajan,et al.  New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties , 2001 .

[93]  D. Reddy Computer recognition of connected speech. , 1967, The Journal of the Acoustical Society of America.

[94]  Ann B. Lee,et al.  Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  C. K. Yuen,et al.  Digital Filters , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[96]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[97]  Bernie Mulgrew,et al.  Proceedings IEEE International Conference on Acoustics Speech and Signal Processing , 1991 .

[98]  P. M. Grant,et al.  Digital communications. 3rd ed , 2009 .

[99]  Norden E. Huang,et al.  INTRODUCTION TO THE HILBERT–HUANG TRANSFORM AND ITS RELATED MATHEMATICAL PROBLEMS , 2005 .

[100]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[101]  Gabriel Rilling,et al.  Empirical mode decomposition as a filter bank , 2004, IEEE Signal Processing Letters.

[102]  N. Huang,et al.  A new view of nonlinear water waves: the Hilbert spectrum , 1999 .

[103]  Yoshio Nakadai,et al.  A speech recognition method for noise environments using dual inputs , 1990, ICSLP.

[104]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[105]  S. Furui On the use of hierarchical spectral dynamics in speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.