Principles and Typical Computational Limitations of Sparse Speaker Separation Based on Deterministic Speech Features

The separation of mixed auditory signals into their sources is an eminent neuroscience and engineering challenge. We reveal the principles underlying a deterministic, neural network–like solution to this problem. This approach is orthogonal to ICA/PCA that views the signal constituents as independent realizations of random processes. We demonstrate exemplarily that in the absence of salient frequency modulations, the decomposition of speech signals into local cosine packets allows for a sparse, noise-robust speaker separation. As the main result, we present analytical limitations inherent in the approach, where we propose strategies of how to deal with this situation. Our results offer new perspectives toward efficient noise cleaning and auditory signal separation and provide a new perspective of how the brain might achieve these tasks.

[1]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[2]  Laura Rebollo-Neira,et al.  Backward-optimized orthogonal matching pursuit approach , 2004, IEEE Signal Processing Letters.

[3]  D. McAlpine Neural sensitivity to periodicity in the inferior colliculus: evidence for the role of cochlear distortions. , 2004, Journal of neurophysiology.

[4]  Tapani Ristaniemi,et al.  Blind separation of convolved mixtures for CDMA systems , 2000, 2000 10th European Signal Processing Conference.

[5]  DeLiang Wang,et al.  Monaural Musical Sound Separation Based on Pitch and Common Amplitude Modulation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  A. Kern,et al.  Analog electronic cochlea with mammalian hearing characteristics , 2007 .

[7]  DeLiang Wang,et al.  Monaural Speech Separation , 2002, NIPS.

[8]  Erkki Oja,et al.  Signal Separation by Nonlinear Hebbian Learning , 1995 .

[9]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[10]  L. Rebollo-Neira,et al.  Optimized orthogonal matching pursuit approach , 2002, IEEE Signal Processing Letters.

[11]  R. Stoop,et al.  A generalization of the van-der-Pol oscillator underlies active signal amplification in Drosophila hearing , 2006, European Biophysics Journal.

[12]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[13]  K. D. Punta,et al.  An ultra-sparse code underlies the generation of neural sequences in a songbird , 2002 .

[14]  Marcelo O Magnasco,et al.  Sparse time-frequency representations , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Rémi Gribonval,et al.  Fast matching pursuit with a multiscale dictionary of Gaussian chirps , 2001, IEEE Trans. Signal Process..

[16]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[17]  Erkki Oja,et al.  Independent component approach to the analysis of EEG and MEG recordings , 2000, IEEE Transactions on Biomedical Engineering.

[18]  Fan-Gang Zeng,et al.  Speech recognition with amplitude and frequency modulations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[19]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[20]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[21]  G. Langner,et al.  Evidence for interactions across frequency channels in the inferior colliculus of awake chinchilla , 2002, Hearing Research.

[22]  T. Sejnowski,et al.  Human Brain Mapping 6:368–372(1998) � Independent Component Analysis of fMRI Data: Examining the Assumptions , 2022 .

[23]  Michael Elad,et al.  From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images , 2009, SIAM Rev..

[24]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[25]  Khaled H. Hamed,et al.  Time-frequency analysis , 2003 .

[26]  Ruedi Stoop,et al.  From Hearing to Listening: Design and Properties of an Actively Tunable Electronic Hearing Sensor , 2007, Sensors.

[27]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[28]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[30]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[31]  Ole Winther,et al.  Low complexity Bayesian single channel source separation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[33]  Ruedi Stoop,et al.  Local cochlear correlations of perceived pitch. , 2010, Physical review letters.

[34]  M. Göpfert,et al.  The mechanical basis of Drosophila audition. , 2002, The Journal of experimental biology.

[35]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[36]  Patrick Flandrin,et al.  Time-Frequency/Time-Scale Analysis , 1998 .

[37]  Te-Won Lee,et al.  A Probabilistic Approach to Single Channel Blind Signal Separation , 2002, NIPS.

[38]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[39]  A. Bregman,et al.  Demonstrations of auditory scene analysis : the perceptual organization of sound , 1995 .