Speech Enhancement Paradigm

Speech enhancement techniques aim at improving the quality and intelligibility of speech that has been degraded by noise. The goal of speech enhancement varies according to the needs of specific applications, such as to increase the overall speech quality or intelligibility, to reduce listener fatigue or to improve the global performance of an ASR embedded in a voice communication system. This chapter begins by giving a background on noise and its estimation and reviews some well-known methods of speech enhancement. It also provides an overview of the various assessment methods used to evaluate speech enhancement algorithms in terms of quality and intelligibility.

[1]  Satoshi Takahashi,et al.  Jacobian approach to fast acoustic model adaptation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Marilyn Y. Chen,et al.  Acoustic correlates of English and French nasalized vowels. , 1997, The Journal of the Acoustical Society of America.

[3]  Neri Merhav,et al.  Lower and upper bounds on the minimum mean-square error in composite source signal estimation , 1991, IEEE Trans. Inf. Theory.

[4]  Alejandro Correa,et al.  Genetic Algorithm Optimization for Selecting the Best Architecture of a Multi-Layer Perceptron Neural Network: A Credit Scoring Case , 2011 .

[5]  Kiyohiro Shikano,et al.  Musical noise generation analysis for noise reduction methods based on spectral subtraction and MMSE STSA estimation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Gerhard Rigoll,et al.  Maximum mutual information neural networks for hybrid connectionist-HMM speech recognition systems , 1994, IEEE Trans. Speech Audio Process..

[7]  Peter Vamplew,et al.  Accelerating Real-Valued Genetic Algorithms Using Mutation-with-Momentum , 2005, Australian Conference on Artificial Intelligence.

[8]  Mark J. F. Gales,et al.  Cepstral parameter compensation for HMM recognition in noise , 1993, Speech Commun..

[9]  James E. Baker,et al.  Reducing Bias and Inefficienry in the Selection Algorithm , 1987, ICGA.

[10]  Yi Hu,et al.  Subjective comparison and evaluation of speech enhancement algorithms , 2007, Speech Commun..

[11]  George Carayannis,et al.  Speech enhancement from noise: A regenerative approach , 1991, Speech Commun..

[12]  Oliver Lemon,et al.  Mixture Model POMDPs for Efficient Handling of Uncertainty in Dialogue Management , 2008, ACL.

[13]  Ching-Ta Lu,et al.  Enhancement of single channel speech using perceptual-decision-directed approach , 2011, Speech Commun..

[14]  Anupam Shukla,et al.  Real Life Applications of Soft Computing , 2010 .

[15]  Chafic Mokbel,et al.  Online adaptation of HMMs to real-life conditions: a unified framework , 2001, IEEE Trans. Speech Audio Process..

[16]  Mark J. F. Gales,et al.  Adaptive training using discriminative mapping transforms , 2008, INTERSPEECH.

[17]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[18]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[19]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[20]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[21]  Weihua Li,et al.  Recursive PCA for adaptive process monitoring , 1999 .

[22]  Andrew Sekey,et al.  An Objective Measure for Predicting Subjective Quality of Speech Coders , 1992, IEEE J. Sel. Areas Commun..

[23]  Philipos C. Loizou,et al.  A noise-estimation algorithm for highly non-stationary environments , 2006, Speech Commun..

[24]  Lotfi A. Zadeh,et al.  Fuzzy logic, neural networks, and soft computing , 1993, CACM.

[25]  Alexander H. Waibel,et al.  Tight coupling of speech recognition and dialog management - dialog-context dependent grammar weighting for speech recognition , 2004, INTERSPEECH.

[26]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[27]  Jungyun Seo,et al.  Dialogue Strategies to Overcome Speech Recognition Errors in Form-Filling Dialogue , 2009, ICCPOL.

[28]  Wonho Yang,et al.  A modified bark spectral distortion measure which uses noise masking threshold , 1997, 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings. Back to Basics: Attacking Fundamental Problems in Speech Coding.

[29]  Douglas D. O'Shaughnessy,et al.  Real-life speech-enabled system to enhance interaction with rfid networks in noisy environments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  H. Franco,et al.  Unsupervised noise model estimation for model-based robust speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[31]  Jyh-Shing Roger Jang,et al.  Minimum phone error discriminative training for Mandarin Chinese speaker adaptation , 2008, INTERSPEECH.

[32]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[33]  Thomas Niesler,et al.  Characterisation and simulation of telephone channels using the TIMIT and NTIMIT databases , 2009 .

[34]  Wayne H. Ward,et al.  Dialog-context dependent language modeling combining n-grams and stochastic context-free grammars , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[35]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[36]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[37]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[38]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[39]  Eric John Diethorn Subband noise reduction methods for speech enhancement , 2000 .

[40]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[41]  Jukka Saarinen,et al.  MLP network for enhancement of noisy MFCC vectors , 1999, EUROSPEECH.

[42]  Mark J. F. Gales,et al.  Unsupervised Adaptation With Discriminative Mapping Transforms , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Roberto Pieraccini,et al.  Automating spoken dialogue management design using machine learning: An industry perspective , 2008, Speech Commun..

[44]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[45]  John H. L. Hansen,et al.  Environment mismatch compensation using average eigenspace for speech recognition , 2008, INTERSPEECH.

[46]  Rainer Martin,et al.  Spectral Subtraction Based on Minimum Statistics , 2001 .

[47]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[48]  Lan Wang,et al.  MPE-based discriminative linear transform for speaker adaptation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  Saeed Gazor,et al.  An adaptive KLT approach for speech enhancement , 2001, IEEE Trans. Speech Audio Process..

[50]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[51]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[52]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[53]  David Maxwell Chickering,et al.  Improving command and control speech recognition on mobile devices: using predictive user models for language modeling , 2006, User Modeling and User-Adapted Interaction.

[54]  Richard M. Stern,et al.  Sources of degradation of speech recognition in the telephone network , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[56]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[57]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[58]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[59]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[60]  Douglas D. O'Shaughnessy,et al.  Speech enhancement using PCA and variance of the reconstruction error in distributed speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[61]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[62]  Yongjoo Jung Improving Robustness in Jacobian Adaptation for Noisy Speech Recognition , 2008, PIT.

[63]  James A. Cadzow,et al.  Signal enhancement-a composite property mapping algorithm , 1988, IEEE Trans. Acoust. Speech Signal Process..

[64]  Milica Gasic,et al.  Effective handling of dialogue state in the hidden information state POMDP-based dialogue manager , 2011, TSLP.

[65]  Sofia Ben Jebara,et al.  Perceptual musical noise reduction using critical bands tonality coefficients and masking thresholds , 2007, INTERSPEECH.

[66]  Véronique Delvaux,et al.  French nasal vowels: acoustic and articulatory properties , 2002, INTERSPEECH.

[67]  Benoît Champagne,et al.  Incorporating the human hearing properties in the signal subspace approach for speech enhancement , 2003, IEEE Trans. Speech Audio Process..

[68]  Irina Illina,et al.  Using genetic algorithms for rapid speaker adaptation , 2003, INTERSPEECH.

[69]  Douglas D. O'Shaughnessy,et al.  Robust automatic speech recognition in low-SNR car environments by the application of a connectionist subspace-based approach to the melbased cepstral coefficients , 2001, INTERSPEECH.

[70]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[71]  I. Cohen,et al.  Noise estimation by minima controlled recursive averaging for robust speech enhancement , 2002, IEEE Signal Processing Letters.

[72]  Gary Geunbae Lee,et al.  A Frame-Based Probabilistic Framework for Spoken Dialog Management Using Dialog Examples , 2008, SIGDIAL Workshop.

[73]  Douglas D. O'Shaughnessy,et al.  Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  S. N. Sivanandam,et al.  Introduction to genetic algorithms , 2007 .

[75]  Pedro M. Valero-Mora,et al.  Determining the Number of Factors to Retain in EFA: An easy-to-use computer program for carrying out Parallel Analysis , 2007 .

[76]  Christopher R. Houck,et al.  A Genetic Algorithm for Function Optimization: A Matlab Implementation , 2001 .

[77]  Nostrand Reinhold,et al.  the utility of using the genetic algorithm approach on the problem of Davis, L. (1991), Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. , 1991 .

[78]  Katarina Bartkova,et al.  Multiple models for improved speech recognition for non-native speakers , 2004 .

[79]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[80]  Louis ten Bosch,et al.  Hybrid HMM/BLSTM-RNN for Robust Speech Recognition , 2010, TSD.

[81]  Karthik Visweswariah,et al.  Language models conditioned on dialog state , 2001, INTERSPEECH.

[82]  Wonyong Sung,et al.  A voice activity detector employing soft decision based noise spectrum adaptation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[83]  Robert E. Yantorno,et al.  Performance of the modified Bark spectral distortion as an objective speech quality measure , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[84]  H.B.D. Sorensen,et al.  A cepstral noise reduction multi-layer neural network , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[85]  Climent Nadeu,et al.  A comparative study of parameters and distances for noisy speech recognition , 1991, EUROSPEECH.

[86]  Yoh'ichi Tohkura,et al.  A weighted cepstral distance measure for speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[87]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[88]  Biing-Hwang Juang,et al.  A family of distortion measures based upon projection operation for robust speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[89]  Douglas D. O'Shaughnessy,et al.  Robustness of speech recognition using genetic algorithms and a Mel-cepstral subspace approach , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[90]  Bernard Widrow,et al.  Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[91]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[92]  Erkki Oja,et al.  Neural Networks, Principal Components, and Subspaces , 1989, Int. J. Neural Syst..

[93]  Douglas D. O'Shaughnessy,et al.  A hybrid HMM/autoregressive Time-Delay Neural Network Automatic Speech Recognition system , 2002, 2002 11th European Signal Processing Conference.

[94]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[95]  S. Joe Qin,et al.  Determining the number of principal components for best reconstruction , 1998 .

[96]  R. Kumaresan,et al.  Data adaptive signal estimation by singular value decomposition of a data matrix , 1982, Proceedings of the IEEE.

[97]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[98]  Nadia Nedjah,et al.  Evolutionary Computation: from Genetic Algorithms to Genetic Programming , 2006, Genetic Systems Programming.

[99]  Jean Caelen Space/time data-information in the A.R.I.A.L. project ear model , 1985, Speech Commun..

[100]  R. R. Leighton,et al.  The autoregressive backpropagation algorithm , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[101]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[102]  Sid-Ahmed Selouani “Well Adjusted”: Using Robust and Flexible Speech Recognition Capabilities in Clean to Noisy Mobile Environments , 2010 .

[103]  Peder A. Olsen,et al.  Dynamic Noise Adaptation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[104]  Douglas D. O'Shaughnessy,et al.  Speech enhancement using PCA and variance of the reconstruction error model identification , 2007, INTERSPEECH.

[105]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[106]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[107]  S. Singh,et al.  Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System , 2011, J. Artif. Intell. Res..

[108]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[109]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[110]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[111]  L. Rabiner,et al.  An interpretation of the log likelihood ratio as a measure of waveform coder performance , 1980 .

[112]  A. Spalanzani,et al.  Evolutionary Algorithms for optimizing speech data projection , 1999 .

[113]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[114]  Roger K. Moore Computer Speech and Language , 1986 .

[115]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[116]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[117]  John H. L. Hansen,et al.  Environmental Sniffing: Noise Knowledge Estimation for Robust Speech Systems , 2007, IEEE Trans. Speech Audio Process..

[118]  Véronique Delvaux,et al.  Discriminant analysis of nasal vs. oral vowels in French: comparison between different parametric representations , 2001, INTERSPEECH.

[119]  M. Eskenazi,et al.  The French language database: Defining, planning, and recording a large database , 1984, ICASSP.

[120]  Edmund R. Malinowski,et al.  Factor Analysis in Chemistry , 1980 .

[121]  P. Woodland,et al.  Discriminative linear transforms for speaker adaptation , 2001 .

[122]  Lan Wang,et al.  MPE-based discriminative linear transforms for speaker adaptation , 2008, Comput. Speech Lang..

[123]  Andrew C. Morris,et al.  Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR , 2005, Comput. Speech Lang..

[124]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[125]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[126]  Douglas D. O'Shaughnessy,et al.  Speaker adaptation using evolutionary-based linear transform , 2006, INTERSPEECH.

[127]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[128]  S.W. Shah,et al.  Interactive Voice Response with Pattern Recognition Based on Artificial Neural Network Approach , 2007, 2007 International Conference on Emerging Technologies.

[129]  Mark J. F. Gales,et al.  MMI-MAP and MPE-MAP for acoustic model adaptation , 2003, INTERSPEECH.

[130]  Yariv Ephraim,et al.  A linear predictive front-end processor for speech recognition in noisy environments , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[131]  Khaled Rasheed,et al.  Guided crossover: a new operator for genetic algorithm based optimization , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[132]  Shaila D. Apte,et al.  Speech and Audio Processing , 2012 .

[133]  João Paulo da Silva Neto,et al.  Context dependent modelling approaches for hybrid speech recognizers , 2010, INTERSPEECH.

[134]  David B. Fogel,et al.  Evolutionary Computation: Toward a New Philosophy of Machine Intelligence (IEEE Press Series on Computational Intelligence) , 2006 .

[135]  Hugo Van hamme,et al.  A Review of Signal Subspace Speech Enhancement and Its Application to Noise Robust Speech Recognition , 2007, EURASIP J. Adv. Signal Process..

[136]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[137]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[138]  Joelle Pineau,et al.  Spoken Dialogue Management Using Probabilistic Reasoning , 2000, ACL.

[139]  Franco Scarselli,et al.  Are Multilayer Perceptrons Adequate for Pattern Recognition and Verification? , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[140]  Yuji Kawaguchi,et al.  The production of French nasal vowels by advanced Japanese and Spanish learners of French: a corpus-based evaluation study , 2010 .

[141]  Manny Rayner,et al.  Adding intelligent help to mixed-initiative spoken dialogue systems , 2002, INTERSPEECH.