Does Phase Matter For Monaural Source Separation?

The "cocktail party" problem of fully separating multiple sources from a single channel audio waveform remains unsolved. Current biological understanding of neural encoding suggests that phase information is preserved and utilized at every stage of the auditory pathway. However, current computational approaches primarily discard phase information in order to mask amplitude spectrograms of sound. In this paper, we seek to address whether preserving phase information in spectral representations of sound provides better results in monaural separation of vocals from a musical track by using a neurally plausible sparse generative model. Our results demonstrate that preserving phase information reduces artifacts in the separated tracks, as quantified by the signal to artifact ratio (GSAR). Furthermore, our proposed method achieves state-of-the-art performance for source separation, as quantified by a mean signal to interference ratio (GSIR) of 19.46.

[1]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[2]  Andrew J Oxenham,et al.  Revisiting place and temporal theories of pitch. , 2013, Acoustical science and technology.

[3]  Gal Chechik,et al.  Auditory abstraction from spectro-temporal features to coding auditory entities , 2012, Proceedings of the National Academy of Sciences.

[4]  Kirill V Nourski,et al.  Representation of temporal sound features in the human auditory cortex , 2011, Reviews in the neurosciences.

[5]  Mark D. Plumbley,et al.  Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network , 2015, LVA/ICA.

[6]  Garrett T. Kenyon,et al.  Phase Transitions in Image Denoising via Sparsely Coding Convolutional Neural Networks , 2017, ArXiv.

[7]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Gal Chechik,et al.  Reduction of Information Redundancy in the Ascending Auditory Pathway , 2006, Neuron.

[9]  Nicole L. Carlson,et al.  Sparse Codes for Speech Predict Spectrotemporal Receptive Fields in the Inferior Colliculus , 2012, PLoS Comput. Biol..

[10]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[12]  Mark D. Plumbley,et al.  Single channel audio source separation using convolutional denoising autoencoders , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[13]  Richard G. Baraniuk,et al.  Sparse Coding via Thresholding and Local Competition in Neural Circuits , 2008, Neural Computation.

[14]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Garrett T. Kenyon,et al.  Learning phase-rich features from streaming auditory images , 2016, 2016 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI).

[16]  Bruno A. Olshausen,et al.  Highly overcomplete sparse coding , 2013, Electronic Imaging.

[17]  Patrick Suppes,et al.  Using phase to recognize English phonemes and their distinctive features in the brain , 2012, Proceedings of the National Academy of Sciences.

[18]  Tom Barker,et al.  Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorization of Modulation Spectrograms , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Chung-Hsien Wu,et al.  Fully complex deep neural network for phase-incorporating monaural source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Simon Haykin,et al.  The Cocktail Party Problem , 2005, Neural Computation.

[22]  C E Schreiner,et al.  Neural processing of amplitude-modulated sounds. , 2004, Physiological reviews.