Supervised Speech Separation Based on Deep Learning: An Overview

Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

[1]  Alex Waibel,et al.  Noise reduction using connectionist models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Jun Du,et al.  Hierarchical deep neural network for multivariate regression , 2017, Pattern Recognit..

[3]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[4]  Oldooz Hazrati,et al.  Blind binary masking for reverberation suppression in cochlear implants. , 2013, The Journal of the Acoustical Society of America.

[5]  Ruth Y Litovsky,et al.  Effect of masker type and age on speech intelligibility and spatial release from masking in children and adults. , 2006, The Journal of the Acoustical Society of America.

[6]  Richard M. Stern,et al.  An analysis of binaural spectro-temporal masking as nonlinear beamforming , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Paris Smaragdis,et al.  Adaptive Denoising Autoencoders: A Fine-Tuning Scheme to Learn from Test Mixtures , 2015, LVA/ICA.

[8]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[9]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[10]  DeLiang Wang,et al.  Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[11]  DeLiang Wang,et al.  An algorithm to improve speech recognition in noise for hearing-impaired listeners. , 2013, The Journal of the Acoustical Society of America.

[12]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[13]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Mike Brookes,et al.  Mask-based enhancement for very low quality speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Richard M. Stern,et al.  Nonlinear enhancement of onset for robust speech recognition , 2010, INTERSPEECH.

[16]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[17]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  DeLiang Wang,et al.  DNN Based Mask Estimation for Supervised Speech Separation , 2018 .

[20]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[21]  Bhiksha Raj,et al.  Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[23]  Lpas Vannoorden,et al.  Temporal coherence in the perception of tone sequences [doctoral dissertation , 1975 .

[24]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[25]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[26]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[27]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[28]  D. Wang,et al.  The time dimension for scene analysis , 2005, IEEE Transactions on Neural Networks.

[29]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[32]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[34]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[36]  Hsiao-Chuan Wang,et al.  Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences , 1999, Speech Commun..

[37]  J. Blauert Spatial Hearing: The Psychophysics of Human Sound Localization , 1983 .

[38]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[39]  Hui Zhang,et al.  Deep stacking networks with time series for speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jun Du,et al.  Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement , 2017, INTERSPEECH.

[41]  WangDeLiang,et al.  A deep ensemble learning method for monaural speech separation , 2016 .

[42]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[43]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[44]  Marco Matassoni,et al.  An auditory based modulation spectral feature for reverberant speech recognition , 2010, INTERSPEECH.

[45]  C. Darwin Auditory grouping , 1997, Trends in Cognitive Sciences.

[46]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[47]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[48]  B. Moore An introduction to the psychology of hearing, 3rd ed. , 1989 .

[49]  Chin-Hui Lee,et al.  A Reverberation-Time-Aware Approach to Speech Dereverberation Based on Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[51]  C. Lam,et al.  Musician Enhancement for Speech-In-Noise , 2009, Ear and hearing.

[52]  Yang Yu,et al.  Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks , 2016, EURASIP J. Audio Speech Music. Process..

[53]  Yonggang Hu,et al.  Perceptual improvement of deep neural networks for monaural speech enhancement , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[54]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[55]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.

[56]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[57]  Jun Du,et al.  A regression approach to binaural speech segregation via deep neural network , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[58]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  NeyHermann,et al.  From feedforward to recurrent LSTM neural networks for language modeling , 2015 .

[60]  DeLiang Wang,et al.  Noise perturbation for supervised speech separation , 2016, Speech Commun..

[61]  Ying-Fang Kao,et al.  Human and Machine Learning , 2018 .

[62]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[63]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[64]  J. Licklider,et al.  A duplex theory of pitch perception , 1951, Experientia.

[65]  DeLiang Wang,et al.  A classification based approach to speech segregation. , 2012, The Journal of the Acoustical Society of America.

[66]  DeLiang Wang,et al.  Speech segregation based on pitch tracking and amplitude modulation , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[67]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[68]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[69]  Masakiyo Fujimoto,et al.  Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Xiaofei Wang,et al.  Oracle performance investigation of the ideal masks , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[71]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[72]  Mike Brookes,et al.  SOBM - a binary mask for noisy speech that optimises an objective intelligibility metric , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  B. Shinn-Cunningham Object-based auditory and visual attention , 2008, Trends in Cognitive Sciences.

[74]  Jun Du,et al.  Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[75]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[76]  Guy J. Brown,et al.  Multiple F0 Estimation , 2006 .

[77]  Richard F. Lyon,et al.  Human and Machine Hearing: Extracting Meaning from Sound , 2017 .

[78]  DeLiang Wang,et al.  An algorithm to increase speech intelligibility for hearing-impaired listeners in novel segments of the same noise type. , 2015, The Journal of the Acoustical Society of America.

[79]  Hiroshi Sawada,et al.  Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors , 2007, Signal Process..

[80]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[81]  Hao Li,et al.  Using optimal ratio mask as training target for supervised speech separation , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[82]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[83]  Pejman Mowlaee,et al.  Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[84]  Jesper Jensen,et al.  Fast noise PSD estimation with low complexity , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[85]  Jonathan Le Roux,et al.  Deep NMF for speech separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[86]  R. Plomp,et al.  Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing. , 1990, The Journal of the Acoustical Society of America.

[87]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[88]  Lauren Calandruccio,et al.  Determination of the Potential Benefit of Time-Frequency Gain Manipulation , 2006, Ear and hearing.

[89]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[90]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[91]  Chung-Hsien Wu,et al.  Fully complex deep neural network for phase-incorporating monaural source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[92]  S. Shamma,et al.  Temporal coherence and attention in auditory scene analysis , 2011, Trends in Neurosciences.

[93]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[94]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[95]  Deliang Wang,et al.  Role of mask pattern in intelligibility of ideal binary-masked noisy speech. , 2009, The Journal of the Acoustical Society of America.

[96]  DeLiang Wang,et al.  Learning spectral mapping for speech dereverberation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[98]  Tim Brookes,et al.  On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis , 2014 .

[99]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[100]  Peter F. Assmann,et al.  The Perception of Speech Under Adverse Conditions , 2004 .

[101]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[102]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[103]  DeLiang Wang,et al.  A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[104]  Tim Pring,et al.  Speech perception in noise by monolingual, bilingual and trilingual listeners. , 2010, International journal of language & communication disorders.

[105]  Philipos C. Loizou,et al.  Reasons why Current Speech-Enhancement Algorithms do not Improve Speech Intelligibility and Suggested Solutions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[106]  C J Darwin,et al.  Listening to speech in the presence of other sounds , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[107]  Andrew J. Oxenham,et al.  Sequential stream segregation of voiced and unvoiced speech sounds based on fundamental frequency , 2017, Hearing Research.

[108]  Jun Du,et al.  Dynamic noise aware training for speech enhancement based on deep neural networks , 2014, INTERSPEECH.

[109]  Liang He,et al.  Convolutional maxout neural networks for speech separation , 2015, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[110]  Emily Buss,et al.  Spondee Recognition in a Two-Talker Masker and a Speech-Shaped Noise Masker in Adults and Children , 2002, Ear and hearing.

[111]  Nicolas Grimault,et al.  The Relationship Between Concurrent Speech Segregation, Pitch-Based Streaming of Vowel Sequences, and Frequency Selectivity , 2012 .

[112]  Hervé Bourlard,et al.  Phase autocorrelation (PAC) derived robust speech features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[113]  DeLiang Wang,et al.  Features for Masking-Based Monaural Speech Separation in Reverberant Conditions , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[114]  Christopher J Rozell,et al.  Outcome measures based on classification performance fail to predict the intelligibility of binary-masked speech. , 2016, The Journal of the Acoustical Society of America.

[115]  Haizhou Li,et al.  Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation , 2016, EURASIP J. Adv. Signal Process..

[116]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[117]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[118]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[119]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[120]  Jayaganesh Swaminathan,et al.  Determining the energetic and informational components of speech-on-speech masking , 2016, The Journal of the Acoustical Society of America.

[121]  Ming Tu,et al.  Speech enhancement based on Deep Neural Networks with skip connections , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[122]  Jun Du,et al.  A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation , 2017, INTERSPEECH.

[123]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[124]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[125]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[126]  Josh H. McDermott,et al.  Sound Segregation via Embedded Repetition Is Robust to Inattention , 2015, Journal of experimental psychology. Human perception and performance.

[127]  Franz Pernkopf,et al.  DNN-based speech mask estimation for eigenvector beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[128]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[129]  G. A. Miller,et al.  The Trill Threshold , 1950 .

[130]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[131]  Richard M. Stern,et al.  Delta-spectral cepstral coefficients for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[132]  Michael I. Jordan,et al.  Learning Spectral Clustering, With Application To Speech Separation , 2006, J. Mach. Learn. Res..

[133]  Shrikanth S. Narayanan,et al.  Long-Term SNR Estimation of Speech Signals in Known and Unknown Channel Conditions , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[134]  Kuldip K. Paliwal,et al.  Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition , 2006, Speech Commun..

[135]  Wei Jiang,et al.  The optimal ratio time-frequency mask for speech separation in terms of the signal-to-noise ratio. , 2013, The Journal of the Acoustical Society of America.

[136]  Richard F. Lyon,et al.  Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[137]  DeLiang Wang,et al.  A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[138]  Tomohiro Nakatani,et al.  Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[139]  G. A. Miller,et al.  The masking of speech. , 1947, Psychological bulletin.

[140]  M R Leek,et al.  FO processing and the separation of competing speech signals by listeners with normal hearing and with hearing loss. , 1998, Journal of speech, language, and hearing research : JSLHR.

[141]  Jun Du,et al.  SNR-Based Progressive Learning of Deep Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[142]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[143]  DeLiang Wang,et al.  An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker. , 2017, The Journal of the Acoustical Society of America.

[144]  L. V. Noorden Temporal coherence in the perception of tone sequences , 1975 .

[145]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[146]  William M. Hartmann,et al.  How we localize sound , 1999 .

[147]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[148]  Sharon Gannot,et al.  A phoneme-based pre-training approach for deep neural network with application to speech enhancement , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[149]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[150]  Emanuel A. P. Habets,et al.  Theory and Applications of Spherical Microphone Array Processing , 2016 .

[151]  John R. Hershey,et al.  Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[152]  Stephen McAdams,et al.  Schema-based processing in auditory scene analysis , 2002, Perception & psychophysics.

[153]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[154]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[155]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[156]  Feng Huang,et al.  Pitch Estimation in Noisy Speech Using Accumulated Peak Spectrum and Sparse Estimation Technique , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[157]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[158]  DeLiang Wang,et al.  Speech intelligibility in background noise with ideal binary time-frequency masking. , 2009, The Journal of the Acoustical Society of America.

[159]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[160]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[161]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[162]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[163]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[164]  DeLiang Wang,et al.  Deep learning reinvents the hearing aid , 2017, IEEE Spectrum.

[165]  Jessica M. Foxton,et al.  Effects of attention and unilateral neglect on auditory stream segregation. , 2001, Journal of experimental psychology. Human perception and performance.

[166]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[167]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[168]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[169]  DeLiang Wang,et al.  Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation , 2016, INTERSPEECH.

[170]  Jun Du,et al.  A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[171]  DeLiang Wang,et al.  A two-stage algorithm for noisy and reverberant speech enhancement , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[172]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[173]  DeLiang Wang,et al.  Boosting Classification Based Speech Separation Using Temporal Dynamics , 2012, INTERSPEECH.

[174]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[175]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[176]  Richard F. Lyon A computational model of binaural localization and separation , 1983, ICASSP.

[177]  Jun Du,et al.  Speech separation of a target speaker based on deep neural networks , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[178]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[179]  Pejman Mowlaee Begzade Mahale,et al.  Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition , 2015, IEEE Signal Processing Letters.

[180]  Zhong-Qiu Wang,et al.  Phoneme-specific speech separation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[181]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[182]  Guoning Hu,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, ICASSP.

[183]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[184]  WangDeLiang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013 .

[185]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[186]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[187]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[188]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[189]  P. Loizou,et al.  Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. , 2008, The Journal of the Acoustical Society of America.

[190]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[191]  Hideki Kashioka,et al.  Speech restoration based on deep learning autoencoder with layer-wised pretraining , 2012, INTERSPEECH.

[192]  DeLiang Wang,et al.  Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[193]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[194]  R. W. Hukin,et al.  Effectiveness of spatial cues, prosody, and talker characteristics in selective attention. , 2000, The Journal of the Acoustical Society of America.

[195]  DeLiang Wang,et al.  Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[196]  Richard M. Stern,et al.  Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[197]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[198]  DeLiang Wang,et al.  Speaker-dependent multipitch tracking using deep neural networks , 2017, INTERSPEECH.

[199]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[200]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[201]  Hynek Hermansky,et al.  Study on the dereverberation of speech based on temporal envelope filtering , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[202]  DeLiang Wang,et al.  Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[203]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[204]  DeLiang Wang,et al.  Neural Network Based Pitch Tracking in Very Noisy Speech , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[205]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[206]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[207]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[208]  Chng Eng Siong,et al.  Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[209]  Tao Zhang,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[210]  J. Mccroskey,et al.  Human Communication , 2008 .

[211]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[212]  Tara N. Sainath,et al.  Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[213]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[214]  Hui Zhang,et al.  Multi-Target Ensemble Learning for Monaural Speech Separation , 2017, INTERSPEECH.

[215]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[216]  Hui Zhang,et al.  A Pairwise Algorithm Using the Deep Stacking Network for Speech Separation and Pitch Estimation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[217]  Jun Du,et al.  A Gender Mixture Detection Approach to Unsupervised Single-Channel Speech Separation Based on Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[218]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.