Modeling Attention and Memory for Auditory Selection in a Cocktail Party Environment

Developing a computational auditory model to solve the cocktail party problem has long bedeviled scientists, especially for a single microphone recording. Although recent deep learning based frameworks have made significant progress in multi-talker mixed speech separation, most existing deep learning based methods, focusing on separating all the speech channels rather than selectively attending the target speech and ignoring other sounds, may fail to offer a satisfactory solution in a complex auditory scene where the number of input sounds is usually uncertain and even dynamic. In this work, we employ ideas from auditory selective attention of behavioral and cognitive neurosciences and from recent advances of memory-augmented neural networks. Specifically, a unified Auditory Selection framework with Attention and Memory (dubbed ASAM) is proposed. Our ASAM first accumulates the prior knowledge (that is the acoustic feature to one specific speaker) into a life-long memory during the training phase, meanwhile a speech perceptor is trained to extract the temporal acoustic feature and update the memory online when a salient speech is given. Then, the learned memory is utilized to interact with the mixture input to attend and filter the target frequency out from the mixture stream. Finally, the network is trained to minimize the reconstruction error of the attended speech. We evaluate the proposed approach on WSJ0 and THCHS-30 datasets and the experimental results demonstrate that our approach successfully conducts two auditory selection tasks: the top-down task-specific attention (e.g. to follow a conversation with friend) and the bottom-up stimulus-driven attention (e.g. be attracted by a salient speech). Compared with deep clustering based methods, our method conducts competitive advantages especially in a real noise environment (e.g. street junction). Our code is available at https://github.com/jacoxu/ASAM.

[1]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[5]  N. Mesgarani,et al.  Selective cortical representation of attended speaker in multi-talker speech perception , 2012, Nature.

[6]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  John J. Foxe,et al.  Attentional Selection in a Cocktail Party Environment Can Be Decoded from Single-Trial EEG. , 2015, Cerebral cortex.

[8]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[9]  Sung-joo Lim,et al.  Selective Attention to Auditory Memory Neurally Enhances Perceptual Precision , 2015, The Journal of Neuroscience.

[10]  Rishidev Chaudhuri,et al.  Computational principles of memory , 2016, Nature Neuroscience.

[11]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[12]  Manoel Fernando Tenorio,et al.  The Cocktail Party Problem: Speech/Data Signal Separation Comparison between Backpropagation and SONN , 1989, NIPS.

[13]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[15]  Dong Wang,et al.  THCHS-30 : A Free Chinese Speech Corpus , 2015, ArXiv.

[16]  Mounya Elhilali,et al.  Modelling auditory attention , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[17]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[18]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[19]  P. Kuhl Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not , 1991, Perception & psychophysics.

[20]  Suncong Zheng,et al.  Hierarchical Memory Networks for Answer Selection on Unknown Words , 2016, COLING.

[21]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[22]  Zhuo Chen,et al.  Single Channel auditory source separation with neural network , 2017 .

[23]  D S Brungart,et al.  Informational and energetic masking effects in the perception of two simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[24]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Aurko Roy,et al.  Learning to Remember Rare Events , 2017, ICLR.

[28]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[29]  Lee M. Miller,et al.  Tuning In to Sound: Frequency-Selective Attentional Filter in Human Primary Auditory Cortex , 2013, The Journal of Neuroscience.