Past review, current progress, and challenges ahead on the cocktail party problem

The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.

[1]  Ying Zhou,et al.  Robust Mask Estimation By Integrating Neural Network-Based and Clustering-Based Approaches for Adaptive Acoustic Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Dong Yu,et al.  Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Erkki Oja,et al.  Independent Component Analysis , 2001 .

[4]  Jun Du,et al.  Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[5]  Kai Yu,et al.  Multi-task learning for text-dependent speaker verification , 2015, INTERSPEECH.

[6]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[7]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[8]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Dong Yu,et al.  Knowledge Transfer in Permutation Invariant Training for Single-Channel Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Peng Li,et al.  Monaural speech separation based on MAXVQ and CASA for robust speech recognition , 2010, Comput. Speech Lang..

[11]  M.N.S. Swamy,et al.  Nonnegative Matrix Factorization , 2014 .

[12]  Reinhold Häb-Umbach,et al.  A generic neural acoustic beamforming architecture for robust multi-channel speech processing , 2017, Comput. Speech Lang..

[13]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[14]  DeLiang Wang,et al.  Co-channel speaker identification using usable speech extraction based on multi-pitch tracking , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Chris H. Q. Ding,et al.  Robust nonnegative matrix factorization using L21-norm , 2011, CIKM '11.

[16]  Bhiksha Raj,et al.  Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[17]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[18]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[19]  Jonathan Le Roux,et al.  Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[20]  Daniel P. W. Ellis,et al.  Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[21]  Dong Yu,et al.  Adaptive Permutation Invariant Training with Auxiliary Information for Monaural Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[23]  Tomohiro Nakatani,et al.  Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[24]  Daniel P. W. Ellis,et al.  Speech enhancement by low-rank and convolutive dictionary spectrogram decomposition , 2014, INTERSPEECH.

[25]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[26]  Geoffrey Zweig,et al.  Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention , 2016, INTERSPEECH.

[27]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[28]  Sven Fischer,et al.  Beamforming microphone arrays for speech acquisition in noisy environments , 1996, Speech Commun..

[29]  Reinhold Häb-Umbach,et al.  Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings , 2017, INTERSPEECH.

[30]  N. Mesgarani,et al.  Selective cortical representation of attended speaker in multi-talker speech perception , 2012, Nature.

[31]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[32]  Shuai Wang,et al.  Joint I-Vector with End-to-End System for Short Duration Text-Independent Speaker Verification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jun Du,et al.  Deep neural network based speech separation for robust speech recognition , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[34]  Hiroshi Sawada,et al.  A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[36]  Stanley J. Wenndt,et al.  Developing usable speech criteria for speaker identification technology , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[37]  Hiroshi Sawada,et al.  A Two-Stage Frequency-Domain Blind Source Separation Method for Underdetermined Convolutive Mixtures , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[38]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[40]  Zhuo Chen,et al.  Single Channel auditory source separation with neural network , 2017 .

[41]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[42]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[44]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[46]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Peter R. Roth,et al.  Effective measurements using digital signal analysis , 1971, IEEE Spectrum.

[48]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[49]  John R. Hershey,et al.  Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system , 2006, INTERSPEECH.

[50]  Deliang Wang,et al.  Role of mask pattern in intelligibility of ideal binary-masked noisy speech. , 2009, The Journal of the Acoustical Society of America.

[51]  Jun Du,et al.  Speech separation of a target speaker based on deep neural networks , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[52]  Jacob Benesty,et al.  Time Delay Estimation in Room Acoustic Environments: An Overview , 2006, EURASIP J. Adv. Signal Process..

[53]  A. Bregman Auditory Scene Analysis , 2008 .

[54]  G. C. Carter,et al.  The smoothed coherence transform , 1973 .

[55]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[56]  DeLiang Wang,et al.  Deep neural networks for cochannel speaker identification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[58]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[59]  S. Applebaum,et al.  Adaptive arrays , 1976 .

[60]  Shuai Wang,et al.  Focal Kl-Divergence Based Dilated Convolutional Neural Networks for Co-Channel Speaker Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[62]  Qi Liu,et al.  Noise Robust Speech Recognition on Aurora4 by Humans and Machines , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[66]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[67]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[68]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[69]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[70]  Ehud Weinstein,et al.  Analysis of the power spectral deviation of the general transfer function GSC , 2004, IEEE Transactions on Signal Processing.

[71]  DeLiang Wang,et al.  Cochannel Speaker Identification in Anechoic and Reverberant Conditions , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[72]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[74]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[75]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[76]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[77]  Ning Ma,et al.  Speech fragment decoding techniques for simultaneous speaker identification and speech recognition , 2010, Comput. Speech Lang..

[78]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[79]  Dong Yu,et al.  Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training , 2017, Speech Commun..

[80]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[81]  Søren Holdt Jensen,et al.  Joint single-channel speech separation and speaker identification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[82]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[83]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[84]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[85]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[86]  Te-Won Lee,et al.  Independent Component Analysis , 1998, Springer US.

[87]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[88]  Marc Moonen,et al.  Design of far-field and near-field broadband beamformers using eigenfilters , 2003, Signal Process..

[89]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[90]  Xiong Xiao,et al.  Cracking the cocktail party problem by multi-beam deep attractor network , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[91]  Jesper Jensen,et al.  Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[92]  Jinyu Li,et al.  Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[93]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[94]  Lucas C. Parra,et al.  A SURVEY OF CONVOLUTIVE BLIND SOURCE SEPARATION METHODS , 2007 .

[95]  Carsten Sydow Broadband beamforming for a microphone array , 1994 .

[96]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[97]  Nobutaka Ono,et al.  Stable and fast update rules for independent vector analysis based on auxiliary function technique , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[98]  Tomi Kinnunen,et al.  A Joint Approach for Single-Channel Speaker Identification and Speech Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[99]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[100]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[101]  J. Capon High-resolution frequency-wavenumber spectrum analysis , 1969 .

[102]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[103]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[104]  R. Boucher,et al.  Performance of the generalized cross correlator in the presence of a strong spectral peak in the signal , 1981 .

[105]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[106]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[107]  Te-Won Lee,et al.  Blind Source Separation Exploiting Higher-Order Frequency Dependencies , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[108]  Jacob Benesty,et al.  On Microphone-Array Beamforming From a MIMO Acoustic Signal Processing Perspective , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[109]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[110]  Kai Yu,et al.  Very deep convolutional neural networks for LVCSR , 2015, INTERSPEECH.

[111]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[112]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[113]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[114]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[115]  Dong Yu,et al.  Recognizing Multi-talker Speech with Permutation Invariant Training , 2017, INTERSPEECH.

[116]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[117]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[118]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[119]  Ron J. Weiss,et al.  Identifying Repeated Patterns in Music Using Sparse Convolutive Non-negative Matrix Factorization , 2010, ISMIR.

[120]  Sven Behnke,et al.  Discovering hierarchical speech features using convolutional non-negative matrix factorization , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[121]  DeLiang Wang,et al.  Segregation of unvoiced speech from nonspeech interference. , 2008, The Journal of the Acoustical Society of America.

[122]  Yi Hu,et al.  Subjective comparison and evaluation of speech enhancement algorithms , 2007, Speech Commun..

[123]  John R. Hershey,et al.  Single Channel Speech Separation Using Factorial Dynamics , 2006, NIPS.

[124]  Walter Kellermann,et al.  Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[125]  P. Kuhl Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not , 1991, Perception & psychophysics.

[126]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).