MODEL-BASED SPARSE COMPONENT ANALYSIS FOR MULTIPARTY DISTANT SPEECH RECOGNITION

This research takes place in the general context of improving the performance of the Distant Speech Recognition (DSR) systems, tackling the reverberation and recognition of overlap speech. Perceptual modeling indicates that sparse representation exists in the auditory cortex. The present project thus builds upon the hypothesis that incorporating this information in DSR front-end processing could improve the speech recognition performance in realistic conditions including overlap and reverberation. More specifically, the goal of my PhD thesis is to exploit blind (source) separation of the speech components in a sparse space, also referred to as sparse component analysis (SCA), for multi-party multi-channel speech recognition.

[1]  DeLiang Wang,et al.  Speech segregation based on pitch tracking and amplitude modulation , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[2]  Jort Gemmeke,et al.  Noise robust ASR: Missing data techniques and beyond , 2006 .

[3]  Birger Kollmeier,et al.  Perception of Speech and Sound , 2008 .

[4]  Barak A. Pearlmutter,et al.  The LOST Algorithm: Finding Lines and Separating Speech Mixtures , 2008, EURASIP J. Adv. Signal Process..

[5]  Rémi Gribonval,et al.  Blind calibration for compressed sensing by convex optimization , 2011, 1111.7248.

[6]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[7]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[8]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[9]  Dennis H. Klatt,et al.  Prediction of perceived phonetic distance from critical-band spectra: A first step , 1982, ICASSP.

[10]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[11]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[12]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[13]  Hiroshi Sawada,et al.  Overcomplete BSS for Convolutive Mixtures Based on Hierarchical Clustering , 2004, ICA.

[14]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[15]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[16]  Volkan Cevher,et al.  Learning with Compressible Priors , 2009, NIPS.

[17]  Bhaskar D. Rao,et al.  Sparse solutions to linear inverse problems with multiple measurement vectors , 2005, IEEE Transactions on Signal Processing.

[18]  R. K. Cook,et al.  Measurement of Correlation Coefficients in Reverberant Sound Fields , 1955 .

[19]  Mark D. Plumbley,et al.  Sparse Coding for Convolutive Blind Audio Source Separation , 2006, ICA.

[20]  N. Mesgarani,et al.  Selective cortical representation of attended speaker in multi-talker speech perception , 2012, Nature.

[21]  Barak A. Pearlmutter,et al.  Soft-LOST: EM on a Mixture of Oriented Lines , 2004, ICA.

[22]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Volkan Cevher,et al.  Multi-Party Speech Recovery Exploiting Structured Sparsity Models , 2011, INTERSPEECH.

[24]  Rémi Gribonval,et al.  Harmonic decomposition of audio signals with matching pursuit , 2003, IEEE Trans. Signal Process..

[25]  Martin J. McKeown,et al.  Underdetermined Anechoic Blind Source Separation via $\ell^{q}$-Basis-Pursuit With $q≪1$ , 2007, IEEE Transactions on Signal Processing.

[26]  Stefan Schacht,et al.  To separate speech: a system for recognizing simultaneous speech , 2007, ICML 2007.

[27]  Rahil Mahdian Toroghi,et al.  Multi-channel speech separation with soft time-frequency masking , 2012, SAPA@INTERSPEECH.

[28]  Volkan Cevher,et al.  Model-based compressive sensing for multi-party distant speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Hiroshi Sawada,et al.  Frequency-Domain Blind Source Separation , 2007, Blind Speech Separation.

[30]  Hiroshi Sawada,et al.  Underdetermined Blind Separation of Convolutive Mixtures of Speech with Directivity Pattern Based Mask and ICA , 2004, ICA.

[31]  Rémi Gribonval,et al.  A survey of Sparse Component Analysis for blind source separation: principles, perspectives, and new challenges , 2006, ESANN.

[32]  Scott T. Rickard,et al.  Histogram-based Blind Source Separation of more sources than sensors using a DUET-ESPRIT technique , 2005, 2005 13th European Signal Processing Conference.

[33]  Hervé Bourlard,et al.  BROADBAND BEAMPATTERN FOR MULTI-CHANNEL SPEECH ACQUISITION AND DISTANT SPEECH RECOGNITION , 2011 .

[34]  Stephen J. Wright,et al.  Computational Methods for Sparse Solution of Linear Inverse Problems , 2010, Proceedings of the IEEE.

[35]  Maurizio Omologo,et al.  Environmental conditions and acoustic transduction in hands-free speech recognition , 1998, Speech Commun..

[36]  Richard M. Stern,et al.  Hearing Is Believing: Biologically Inspired Methods for Robust Automatic Speech Recognition , 2012, IEEE Signal Processing Magazine.

[37]  Zhilin Zhang,et al.  Exploiting Correlation in Sparse Signal Recovery Problems: Multiple Measurement Vectors, Block Sparsity, and Time-Varying Sparsity , 2011, ArXiv.

[38]  Bhaskar D. Rao,et al.  Extension of SBL Algorithms for the Recovery of Block Sparse Signals With Intra-Block Correlation , 2012, IEEE Transactions on Signal Processing.

[39]  Harry L. Van Trees,et al.  Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory , 2002 .

[40]  Iain McCowan,et al.  Microphone array speech recognition: experiments on overlapping speech in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[41]  Volkan Cevher,et al.  Computational methods for structured sparse component analysis of convolutive speech mixtures , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Hervé Bourlard,et al.  Euclidean distance matrix completion for ad-hoc microphone array calibration , 2013, 2013 18th International Conference on Digital Signal Processing (DSP).

[43]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[44]  Bhaskar D. Rao,et al.  Sparse Signal Recovery With Temporally Correlated Source Vectors Using Sparse Bayesian Learning , 2011, IEEE Journal of Selected Topics in Signal Processing.

[45]  T. Kailath,et al.  A least-squares approach to blind channel identification , 1995, IEEE Trans. Signal Process..

[46]  Stefan Hildebrandt,et al.  The parsimonious universe : shape and form in the natural world , 1996 .

[47]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[48]  Hervé Bourlard,et al.  Non-Stationary Multi-Channel (Multi-Stream) Processing Towards Robust and Adaptive ASR , 1999 .

[49]  Parikshit Shah,et al.  Compressed Sensing Off the Grid , 2012, IEEE Transactions on Information Theory.

[50]  Dmitry M. Malioutov,et al.  A sparse signal reconstruction perspective for source localization with sensor arrays , 2005, IEEE Transactions on Signal Processing.

[51]  Takuya Yoshioka,et al.  Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Jacob Benesty,et al.  A class of frequency-domain adaptive approaches to blind multichannel identification , 2003, IEEE Trans. Signal Process..

[53]  Hynek Hermansky,et al.  Search for Information Bearing Components in Speech , 1999, NIPS.

[54]  Pierre Vandergheynst,et al.  Compressed Sensing of Simultaneous Low-Rank and Joint-Sparse Matrices , 2012, ArXiv.

[55]  Afsaneh Asaei,et al.  An integrated framework for multi-channel multi-source localization and voice activity detection , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[56]  Reinhold Haeb-Umbach,et al.  Robust Speech Recognition of Uncertain or Missing Data - Theory and Applications , 2011 .

[57]  Francesco Nesta,et al.  Convolutive Underdetermined Source Separation through Weighted Interleaved ICA and Spatio-temporal Source Correlation , 2012, LVA/ICA.

[58]  D S Brungart,et al.  Informational and energetic masking effects in the perception of two simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[59]  Hervé Bourlard,et al.  Improving speech recognition performance of small microphone arrays using missing data techniques , 2002, INTERSPEECH.

[60]  Sharon Gannot,et al.  Sensitivity analysis of MVDR and MPDR beamformers , 2010, 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel.

[61]  Iain McCowan,et al.  Robust speech recognition using near-field superdirective beamforming with post-filtering , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[62]  Scott T. Rickard,et al.  Underdetermined Blind Source Separation in Echoic Environments Using DESPRIT , 2007, EURASIP J. Adv. Signal Process..

[63]  Hervé Bourlard,et al.  Sparse component analysis for speech recognition in multi-speaker environment , 2010, INTERSPEECH.

[64]  Hervé Bourlard,et al.  Microphone array post-filter based on noise field coherence , 2003, IEEE Trans. Speech Audio Process..

[65]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[66]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[67]  Volkan Cevher,et al.  Recipes on hard thresholding methods , 2011, 2011 4th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[68]  T. Ajdler,et al.  The Plenacoustic Function and Its Sampling , 2006, IEEE Transactions on Signal Processing.

[69]  Christopher V. Alvino,et al.  Geometric source separation: merging convolutive source separation with geometric beamforming , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[70]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[71]  Michael Zibulevsky,et al.  Underdetermined blind source separation using sparse representations , 2001, Signal Process..

[72]  Carl F. Eyring,et al.  Reverberation Time in “Dead” Rooms , 1930 .

[73]  L. Carney,et al.  A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. , 2001, The Journal of the Acoustical Society of America.

[74]  Yonina C. Eldar,et al.  Exploiting Statistical Dependencies in Sparse Representations for Signal Recovery , 2010, IEEE Transactions on Signal Processing.

[75]  Emanuel A. P. Habets,et al.  Speech Dereverberation Using Statistical Reverberation Models , 2010, Speech Dereverberation.

[76]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[77]  Yannick Deville,et al.  A time-frequency blind signal separation method applicable to underdetermined mixtures of dependent sources , 2005, Signal Process..

[78]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[79]  C. Cherry,et al.  On human communication , 1966 .

[80]  Volkan Cevher,et al.  Model-Based Compressive Sensing , 2008, IEEE Transactions on Information Theory.

[81]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[82]  N. Mitianoudis,et al.  Simple mixture model for sparse overcomplete ICA , 2004 .

[83]  S. David,et al.  Estimating sparse spectro-temporal receptive fields with natural stimuli , 2007, Network.

[84]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[85]  Heping Ding,et al.  A Region-Growing Permutation Alignment Approach in Frequency-Domain Blind Source Separation of Speech Mixtures , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[86]  Michael Zibulevsky,et al.  Signal reconstruction in sensor arrays using sparse representations , 2006, Signal Process..

[87]  Bhiksha Raj,et al.  Joint sparsity models for wideband array processing , 2011, Optical Engineering + Applications.

[88]  O. L. Frost,et al.  An algorithm for linearly constrained adaptive array processing , 1972 .

[89]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[90]  L. Carin,et al.  On the Relationship Between Compressive Sensing and Random Sensor Arrays , 2009, IEEE Antennas and Propagation Magazine.

[91]  Hossein Sameti,et al.  Far-field continuous speech recognition system based on speaker Localization and sub-band Beamforming , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[92]  Volkan Cevher,et al.  Near-optimal Bayesian localization via incoherence and sparsity , 2009, 2009 International Conference on Information Processing in Sensor Networks.

[93]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[94]  Michael E. Tipping Sparse Bayesian Learning and the Relevance Vector Machine , 2001, J. Mach. Learn. Res..

[95]  Andreas Ziehe,et al.  The 2011 Signal Separation Evaluation Campaign (SiSEC2011): - Audio Source Separation - , 2012, LVA/ICA.

[96]  Bin Guo,et al.  Coherence, Compressive Sensing, and Random Sensor Arrays , 2011, IEEE Antennas and Propagation Magazine.

[97]  James P. Reilly,et al.  Modified hierarchical clustering for sparse component analysis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[98]  Hiroshi Sawada,et al.  Underdetermined blind separation for speech in real environments with sparseness and ICA , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[99]  Barak A. Pearlmutter,et al.  Blind Source Separation by Sparse Decomposition in a Signal Dictionary , 2001, Neural Computation.

[100]  Andreas Ziehe,et al.  An approach to blind source separation based on temporal structure of speech signals , 2001, Neurocomputing.

[101]  Mohammed Ghanbari,et al.  Verified speaker localization utilizing voicing level in split-bands , 2009, Signal Process..

[102]  Hiroshi Sawada,et al.  A NOVEL BLIND SOURCE SEPARATION METHOD WITH OBSERVATION VECTOR CLUSTERING , 2005 .

[103]  Mike E. Davies,et al.  Gradient Pursuits , 2008, IEEE Transactions on Signal Processing.

[104]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[105]  Eap Emanuël Habets Single- and multi-microphone speech dereverberation using spectral enhancement , 2007 .

[106]  Sven Nordholm,et al.  Mel-Spectrographic Mask Estimation for Missing Data Speech Recognition using Short-Time-Fourier-Transform Ratio Estimators , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[107]  Thomas Hain,et al.  Recognition and understanding of meetings the AMI and AMIDA projects , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[108]  Volkan Cevher,et al.  An ALPS view of sparse recovery , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[109]  Ivan Himawan,et al.  Microphone Array Beamforming Approach to Blind Speech Separation , 2007, MLMI.

[110]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[111]  Michael Elad,et al.  Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing , 2010 .

[112]  Walter Kellermann,et al.  TRINICON-based Blind System Identification with Application to Multiple-Source Localization and Separation , 2007, Blind Speech Separation.

[113]  Martin Vetterli,et al.  Can one hear the shape of a room: The 2-D polygonal case , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[114]  Ivan Himawan,et al.  Microphone Array Shape Calibration in Diffuse Noise Fields , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[115]  T. Nakatani,et al.  Mathematical analysis of speech dereverberation based on time-varying Gaussian source model: Its solution and convergence characteristics , 2011, 2011 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC).

[116]  T. W. Parsons Separation of speech from interfering speech by means of harmonic selection , 1976 .

[117]  Brian N. Pasley,et al.  Reconstructing Speech from Human Auditory Cortex , 2012, PLoS biology.

[118]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[119]  Roberto Togneri,et al.  Time-Frequency Masking: Linking Blind Source Separation and Robust Speech Recognition , 2008 .

[120]  Bhiksha Raj,et al.  Maximum kurtosis beamforming with a subspace filter for distant speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[121]  Jacob Benesty,et al.  A blind channel identification-based two-stage approach to separation and dereverberation of speech signals in a reverberant environment , 2005, IEEE Transactions on Speech and Audio Processing.

[122]  Michael P. Friedlander,et al.  Probing the Pareto Frontier for Basis Pursuit Solutions , 2008, SIAM J. Sci. Comput..

[123]  Barak A. Pearlmutter,et al.  Hard-LOST: modified k-means for oriented lines , 2004 .

[124]  Sridha Sridharan,et al.  Near-field Adaptive Beamformer for Robust Speech Recognition , 2002, Digit. Signal Process..

[125]  M. Vetterli,et al.  Sparse Sampling of Signal Innovations , 2008, IEEE Signal Processing Magazine.

[126]  Bhaskar D. Rao,et al.  An Empirical Bayesian Strategy for Solving the Simultaneous Sparse Approximation Problem , 2007, IEEE Transactions on Signal Processing.

[127]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[128]  Diego H. Milone,et al.  Perceptual evaluation of blind source separation for robust speech recognition , 2008, Signal Process..

[129]  Volkan Cevher,et al.  Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings , 2012, ArXiv.

[130]  Cha Zhang,et al.  L1 regularized room modeling with compact microphone arrays , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[131]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[132]  Guy J. Brown,et al.  Speech segregation based on sound localization , 2003 .

[133]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[134]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[135]  Stefano Squartini,et al.  Joint Multichannel Blind Speech Separation and Dereverberation: A Real-Time Algorithmic Implementation , 2010, ICIC.

[136]  J. Borish Extension of the image model to arbitrary polyhedra , 1984 .

[137]  Douglas L. Jones,et al.  Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms. , 2004, The Journal of the Acoustical Society of America.

[138]  Mike E. Davies,et al.  A New Framework for Underdetermined Speech Extraction Using Mixture of Beamformers , 2011, IEEE Transactions on Audio, Speech, and Language Processing.