Speech signal enhancement by information combining

Mobile phones as well as tablets are omnipresent and belong to everyday life. Today audiovisual communication takes place at different locations and in a large variety of acoustic environments. In consequence, the intelligibility as well as the quality of speech may significantly be degraded by ambient background noise. In order to improve speech intelligibility and to ensure a convenient communication with high audio quality, speech enhancement techniques are required. In this thesis all critical components contributing to the enhancement of the up-link signal are addressed: • signal capturing at the acoustic front-end with a new near field beamformer, • new codebook based speech and noise estimation procedure generating and exploiting reliability information, and • actual noise reduction exploiting spectral dependencies of human speech. For the acoustic front-end of the digital processing chain a novel concept for the filter optimization of a near field beamformer is introduced. The optimization scheme allows to closely approximate a predefined reception characteristic which can be freely chosen according to the application. The output of the beamformer provides a pre-enhanced signal with improved SNR for subsequent single-microphone based speech enhancement. Single-microphone noise reduction usually relies on statistical properties of speech and noise. In general, the noise is assumed to be stationary or only slightly time-varying, which is in practice often not fulfilled. Due to imprecise noise estimation, single-microphone systems are prone to unpleasant artifacts that are called musical tones. In this context different Information Combining methods, merging various estimates, are presented which address specifically the problem of non-stationary noise signals, leading to a significant improved estimation accuracy. On the one hand, the proposed Information Combining is used with respect to spectral dependencies of human speech. On the other hand, it merges the best of several speech and noise estimates depending on their reliability. The necessary estimates are provided by a new statistical noise estimator as well as a codebook driven speech and noise estimation algorithm. The achieved estimation quality opens up the possibility to close the gap between the conflicting goals of high noise attenuation, low speech distortion, and the prevention of undesired musical tone artifacts. Finally, the practical aspects of the proposed enhancement systems are considered and discussed with two implemented real-time demonstrators.

[1]  I N Bronstein,et al.  Taschenbuch der Mathematik , 1966 .

[2]  Christophe Beaugeant,et al.  Challenges of 16 kHz in acoustic pre- and post-processing for terminals , 2006, IEEE Communications Magazine.

[3]  K. J. Ray Liu,et al.  Handbook on Array Processing and Sensor Networks , 2010 .

[4]  Peter Vary,et al.  IIR QMF-bank design for speech and audio subband coding , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Rodney A. Kennedy,et al.  Nearfield broadband frequency invariant beamforming , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Peter Vary,et al.  Digital Speech Transmission: Enhancement, Coding and Error Concealment , 2006 .

[7]  John H. L. Hansen Speech enhancement employing adaptive boundary detection and morphological based spectral constraints , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[9]  Kuldip K. Paliwal,et al.  Preference for 20-40 ms window duration in speech analysis , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[10]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[11]  Gerhard Doblinger,et al.  Computationally efficient speech enhancement by spectral minima tracking in subbands , 1995, EUROSPEECH.

[12]  Wenwu Wang,et al.  Interference Reduction in Reverberant Speech Separation With Visual Voice Activity Detection , 2014, IEEE Transactions on Multimedia.

[13]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[14]  Peter Vary,et al.  Wind noise short term power spectrum estimation using pitch adaptive inverse binary masks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Peter Vary,et al.  Audiosignalverarbeitung für Videokonferenzsysteme , 2013, GI-Jahrestagung.

[16]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[17]  Abeer Alwan,et al.  Voice activity detection using harmonic frequency components in likelihood ratio test , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Peter Vary,et al.  Intelligibility Assessment of a System for Artifical Bandwidth Extension of Telephone Speech , 2012 .

[19]  Gerhard Schmidt,et al.  Low-complexity noise power spectral density estimation for harsh automobile environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[20]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 2000, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[21]  Fredric J. Harris,et al.  Multirate Signal Processing for Communication Systems , 2004 .

[22]  Patrick A. Naylor,et al.  Corpus based reconstruction of speech degraded by wind noise , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[23]  Christophe Beaugeant,et al.  Dual microphone noise PSD estimation for mobile phones in hands-free position exploiting the coherence and speech presence probability , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[25]  Dirk Van Compernolle Noise adaptation in a hidden Markov model speech recognition system , 1989 .

[26]  Unto K. Laine,et al.  Splitting the Unit Delay - Tools for fractional delay filter design , 1996 .

[27]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[28]  Johannes B. Huber,et al.  Information Combining , 2006, Found. Trends Commun. Inf. Theory.

[29]  Ronald E. Crochiere,et al.  A weighted overlap-add method of short-time Fourier analysis/Synthesis , 1980 .

[30]  Rainer Martin,et al.  Advances in Digital Speech Transmission , 2008 .

[31]  W. Bastiaan Kleijn,et al.  Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Andrea Cavallaro,et al.  Target Detection and Tracking With Heterogeneous Sensors , 2008, IEEE Journal of Selected Topics in Signal Processing.

[33]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[34]  Rainer Martin,et al.  Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  S. Gannot,et al.  Comparison of supervised and semi-supervised beamformers using real audio recordings , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[36]  Israel Cohen,et al.  Single-Channel Transient Interference Suppression With Diffusion Maps , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Cláudio Rosito Jung,et al.  Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs , 2014, IEEE Transactions on Multimedia.

[38]  Thippur V. Sreenivas,et al.  Codebook constrained Wiener filtering for speech enhancement , 1996, IEEE Trans. Speech Audio Process..

[39]  Peter Vary,et al.  Artificial bandwidth extension without side information for ITU-t g.729.1 , 2007, INTERSPEECH.

[40]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[41]  Jerry D. Gibson,et al.  Digital coding of waveforms: Principles and applications to speech and video , 1985, Proceedings of the IEEE.

[42]  Rafik A. Goubran,et al.  Array optimization applied in the near field of a microphone array , 2000, IEEE Trans. Speech Audio Process..

[43]  Rainer Martin,et al.  Cepstral Smoothing of Spectral Filter Gains for Speech Enhancement Without Musical Noise , 2007, IEEE Signal Processing Letters.

[44]  A. Vahatalo,et al.  Voice activity detection for GSM adaptive multi-rate codec , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[45]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[46]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  R. C. Williamson,et al.  Theory and design of broadband sensor arrays with frequency invariant far‐field beam patterns , 1995 .

[48]  Sungjin Park,et al.  Speech Intelligibility Enhancement using Tunable Equalization Filter , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[49]  D. G. Brennan,et al.  Linear diversity combining techniques , 2003 .

[50]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[51]  Ye Li,et al.  Speech Enhancement for Non-Stationary Noise Environments , 2009, 2009 International Conference on Information Engineering and Computer Science.

[52]  Peter Vary,et al.  Selflearning Codebook Speech Enhancement , 2014, ITG Symposium on Speech Communication.

[53]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[54]  Stefan Ernst,et al.  Combination of two-channel spectral subtraction and adaptive wiener post-filtering for noise reduction and dereverberation , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[55]  Peter Jax,et al.  On artificial bandwidth extension of telephone speech , 2003, Signal Process..

[56]  N. P. Fan,et al.  Multichannel voice detection in adverse environments , 2002, 2002 11th European Signal Processing Conference.

[57]  Henning Puder,et al.  Improved Gain Estimation for Codebook-Based Speech Enhancement , 2012, ITG Conference on Speech Communication.

[58]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[59]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[60]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[61]  Peter Vary,et al.  Noise PSD estimation by logarithmic baseline tracing , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[62]  Wei Liu,et al.  Subband design of fixed wideband beamformers based on the least squares approach , 2011, Signal Process..

[63]  Gerhard Schmidt,et al.  Topics in acoustic echo and noise control : selected methods for the cancellation of acoustical echoes, the reduction of background noise, and speech processing ; with 32 tables , 2006 .

[64]  Paul A. Viola,et al.  Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos , 2008, IEEE Transactions on Multimedia.

[65]  Lou Boves,et al.  Channel normalization techniques for automatic speech recognition over the telephone , 1998, Speech Commun..

[66]  Simon Huettinger,et al.  Information Processing and Combining in Channel Coding , 2004 .

[67]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[68]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[69]  Peter Vary,et al.  Recursive Closed-Form Optimization of Spectral Audio Power Allocation for Near End Listening Enhancement , 2010, Sprachkommunikation.

[70]  Saeed Vaseghi,et al.  Advanced Signal Processing and Digital Noise Reduction , 1996 .

[71]  Arthur Schuster,et al.  On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena , 1898 .

[72]  Pablo César,et al.  Enabling Composition-Based Video-Conferencing for the Home , 2011, IEEE Transactions on Multimedia.

[73]  Peter Vary,et al.  Dual channel reduction of rapidly varying harmonic and random noise using a spot microphone , 2011 .

[74]  Simon J. Godsill,et al.  Detection and suppression of keyboard transient noise in audio streams with auxiliary keybed microphone , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[75]  Rainer Martin,et al.  Spectral Subtraction Based on Minimum Statistics , 2001 .

[76]  Thomas Esch,et al.  Wideband noise suppression supported by artificial bandwidth extension techniques , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[77]  Biing-Hwang Juang,et al.  Line spectrum pair (LSP) and speech data compression , 1984, ICASSP.

[78]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[79]  Masoud Salehi,et al.  Communication Systems Engineering , 1994 .

[80]  D. V. Anderson,et al.  FFT-Based Block Processing in Speech Enhancement: Potential Artifacts and Solutions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[81]  Peter Vary,et al.  Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model , 2005, EURASIP J. Adv. Signal Process..

[82]  Martin Westphal,et al.  The use of cepstral means in conversational speech recognition , 1997, EUROSPEECH.

[83]  Jesper Jensen,et al.  Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[84]  Johannes B. Huber,et al.  Bounds on information combining , 2005, IEEE Transactions on Information Theory.

[85]  Nobuhiko Kitawaki,et al.  Pure Delay Effects on Speech Quality in Telecommunications , 1991, IEEE J. Sel. Areas Commun..

[86]  Jacob Benesty,et al.  Noise Reduction in Speech Processing , 2009 .

[87]  Rainer Martin,et al.  Improved A Posteriori Speech Presence Probability Estimation Based on a Likelihood Ratio With Fixed Priors , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[88]  D. Rajan Probability, Random Variables, and Stochastic Processes , 2017 .

[89]  Jae S. Lim,et al.  The unimportance of phase in speech enhancement , 1982 .

[90]  Joachim M. Buhmann,et al.  Speech Enhancement Using Generative Dictionary Learning , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[91]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[92]  S. Nordholm,et al.  Non-uniform Optimal Subband Beamforming: An Evaluation on Real Acoustic Measurements , 2008, 2008 Congress on Image and Signal Processing.

[93]  Tomohiro Nakatani,et al.  Noisy speech enhancement based on prior knowledge about spectral envelope and harmonic structure , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[94]  Norbert Wiener,et al.  Extrapolation, Interpolation, and Smoothing of Stationary Time Series , 1964 .

[95]  Thomas Esch,et al.  Noise Reduction for Wideband Speech Exploiting Spectral Dependencies Based on Conditional Estimation , 2010, Sprachkommunikation.

[96]  Alexander H. Waibel,et al.  Knowing who to listen to in speech recognition: visually guided beamforming , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[97]  Peter Vary,et al.  Intelligibility Enhancement For Hands-Free Mobile Communication , 2015 .

[98]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[99]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[100]  Christopher Bulla,et al.  Performance Evaluation of Object Representations in Mean Shift Tracking , 2013, MMEDIA 2013.

[101]  Christophe Beaugeant,et al.  Robust dual-channel noise power spectral density estimation , 2011, 2011 19th European Signal Processing Conference.

[102]  Changchun Bao,et al.  Speech enhancement based on AR model parameters estimation , 2016, Speech Commun..

[103]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[104]  Rainer Martin,et al.  An evaluation of noise power spectral density estimation algorithms in adverse acoustic environments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[105]  Marc Moonen,et al.  Design of far-field and near-field broadband beamformers using eigenfilters , 2003, Signal Process..

[106]  Israel Cohen,et al.  Multichannel Eigenspace Beamforming in a Reverberant Noisy Environment With Multiple Interfering Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[107]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[108]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[109]  Peter Vary,et al.  Numerical near field optimization of a non-uniform sub-band filter-and-sum beamformer , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[110]  Sascha Spors,et al.  Joint audio-video object localization and tracking , 2001 .

[111]  Hing-Cheung So,et al.  Speech enhancement in car noise envoronment based on an analysis-synthesis approach using harmonic noise model , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[112]  Changchun Bao,et al.  An improved dictionary learning method for speech enhancement , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[113]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[114]  Mike Brookes,et al.  PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[115]  Kung Yao,et al.  Broadband array processing using subband techniques , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[116]  Afsaneh Asaei,et al.  An integrated framework for multi-channel multi-source localization and voice activity detection , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[117]  Peter Vary,et al.  Numerical Near Field Optimization of Weighted Delay-and-Sum Microphone Arrays , 2012, IWAENC.

[118]  Heinrich W. Lollmann,et al.  Allpass based analysis synthesis filter banks : design and application , 2011 .

[119]  Peter Vary,et al.  Noise suppression by spectral magnitude estimation —mechanism and theoretical limits— , 1985 .

[120]  Jorge Nocedal,et al.  A trust region method based on interior point techniques for nonlinear programming , 2000, Math. Program..

[121]  Peter Vary,et al.  Multichannel audio database in various acoustic environments , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[122]  Boaz Rafaely,et al.  Near-Field Spherical Microphone Array Processing With Radial Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[123]  John Charles Cox,et al.  The minimum detectable delay of speech and music , 1984, ICASSP.

[124]  J. Litva,et al.  Radar Array Processing , 1993 .

[125]  A. Kondoz,et al.  Analysis and improvement of a statistical model-based voice activity detector , 2001, IEEE Signal Processing Letters.

[126]  Wei-Ping Zhu,et al.  Robust pitch estimation at very low SNR exploiting time and frequency domain cues , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[127]  Peter Vary,et al.  A Modified Minimum Statistics Algorithm for Reducing Time Varying Harmonic Noise , 2010, Sprachkommunikation.

[128]  Sven Nordholm,et al.  Design of oversampled uniform DFT filter banks with delay specification using quadratic optimization , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[129]  Christopher Bulla,et al.  High Quality Video Conferencing: Region of Interest Encoding and Joint Video/Audio Analysis , 2013 .

[130]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[131]  Jerry D. Gibson,et al.  COMPARISON OF DISTANCE MEASURES IN DISCRETE SPECTRAL MODELING , 2000 .

[132]  Richard C. Hendriks,et al.  Noise power estimation based on the probability of speech presence , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).