Online blind speech separation using multiple acoustic speaker tracking and time-frequency masking

Separating speech signals of multiple simultaneous talkers in a reverberant enclosure is known as the cocktail party problem. In real-time applications online solutions capable of separating the signals as they are observed are required in contrast to separating the signals offline after observation. Often a talker may move, which should also be considered by the separation system. This work proposes an online method for speaker detection, speaker direction tracking, and speech separation. The separation is based on multiple acoustic source tracking (MAST) using Bayesian filtering and time-frequency masking. Measurements from three room environments with varying amounts of reverberation using two different designs of microphone arrays are used to evaluate the capability of the method to separate up to four simultaneously active speakers. Separation of moving talkers is also considered. Results are compared to two reference methods: ideal binary masking (IBM) and oracle tracking (O-T). Simulations are used to evaluate the effect of number of microphones and their spacing.

[1]  Jean Rouat,et al.  Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering , 2007, Robotics Auton. Syst..

[2]  Eric A. Lehmann,et al.  Particle filtering methods for acoustic source localisation and tracking , 2004 .

[3]  Ari Visa,et al.  Measurement Combination for Acoustic Source Localization in a Room Environment , 2008, EURASIP J. Audio Speech Music. Process..

[4]  Benedikt Loesch,et al.  Online blind source separation based on time-frequency sparseness , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Özgür Yilmaz,et al.  On the approximate W-disjoint orthogonality of speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Alex Acero,et al.  Reverberated speech signal separation based on regularized subband feedforward ICA and instantaneous direction of arrival , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Lucas C. Parra,et al.  Convolutive Blind Source Separation Methods , 2008 .

[8]  Ivan Himawan,et al.  Microphone Array Shape Calibration in Diffuse Noise Fields , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Daniel Gatica-Perez,et al.  Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  D. M. Campbell,et al.  Springer Handbook of Acoustics , 2015 .

[11]  Masataka Goto,et al.  Real-time sound source localization and separation system and its application to automatic speech recognition , 2001, INTERSPEECH.

[12]  Hiroshi Sawada,et al.  Robust real-time blind source separation for moving speakers in a room , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[13]  Matti S. Hämäläinen,et al.  A track before detect approach for sequential Bayesian tracking of multiple speech sources , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[15]  J. N. Driessen,et al.  Particle filter based detection for tracking , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[16]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[17]  Maurice Fallon,et al.  Multi Target Acoustic Source Tracking using Track Before Detect , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[18]  Hiroshi Sawada,et al.  K-means Based Underdetermined Blind Speech Separation , 2007, Blind Speech Separation.

[19]  Hiroshi Sawada,et al.  A robust and precise method for solving the permutation problem of frequency-domain blind source separation , 2004, IEEE Transactions on Speech and Audio Processing.

[20]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  DeLiang Wang,et al.  Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[23]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[24]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[25]  Ivan Tashev,et al.  Sound Capture and Processing: Practical Approaches , 2009 .

[26]  Sylvain Marchand,et al.  A Source Localization/Separation/Respatialization System Based on Unsupervised Classification of Interaural Cues , 2006 .

[27]  H. Sabine Room Acoustics , 1953, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[28]  P.M. Djuric,et al.  Target Tracking by Multiple Particle Filtering , 2007, 2007 IEEE Aerospace Conference.

[29]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[30]  Dinh-Tuan Pham,et al.  A new EM algorithm for underdetermined convolutive blind source separation , 2009, 2009 17th European Signal Processing Conference.

[31]  Glenn E. Bugos,et al.  Atmosphere of Freedom: Sixty Years at the NASA Ames Research Center , 2000 .

[32]  Peter Jancovic,et al.  Underdetermined DOA Estimation via Independent Component Analysis and Time-Frequency Masking , 2010, J. Electr. Comput. Eng..

[33]  Alessio Brutti,et al.  A sequential Monte Carlo approach for tracking of overlapping acoustic sources , 2009, 2009 17th European Signal Processing Conference.

[34]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[35]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[36]  DeLiang Wang,et al.  Two-Microphone Separation of Speech Mixtures , 2008, IEEE Transactions on Neural Networks.

[37]  W. Ritter,et al.  Detection and Tracking of Multiple Pedestrians in Automotive Applications , 2007, 2007 IEEE Intelligent Vehicles Symposium.

[38]  Walter Kellermann,et al.  Multidimensional localization of multiple sound sources using averaged directivity patterns of Blind Source Separation systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Miao Yu,et al.  A Multimodal Approach to Blind Source Separation of Moving Sources , 2010, IEEE Journal of Selected Topics in Signal Processing.

[40]  Benedikt Loesch,et al.  Blind Source Separation Based on Time-Frequency Sparseness in the Presence of Spatial Aliasing , 2010, LVA/ICA.

[41]  William Fitzgerald,et al.  A Bayesian approach to tracking multiple targets using sensor arrays and particle filters , 2002, IEEE Trans. Signal Process..

[42]  Hiroshi Sawada,et al.  DOA Estimation for Multiple Sparse Sources with Arbitrarily Arranged Multiple Sensors , 2011, J. Signal Process. Syst..

[43]  Justinian P. Rosca,et al.  REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION , 2001 .

[44]  Nicoleta Roman,et al.  Intelligibility of reverberant noisy speech with ideal binary masking. , 2011, The Journal of the Acoustical Society of America.

[45]  Darren B. Ward,et al.  Particle filtering algorithms for tracking an acoustic source in a reverberant environment , 2003, IEEE Trans. Speech Audio Process..

[46]  Jouko Lampinen,et al.  Rao-Blackwellized particle filter for multiple target tracking , 2007, Inf. Fusion.

[47]  T. Cawthorne,et al.  PHYSIOLOGICAL ACOUSTICS , 1955 .

[48]  Nedelko Grbic,et al.  Source localization for multiple speech sources using low complexity non-parametric source separation and clustering , 2011, Signal Process..

[49]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.