Group Delay Based Methods for Speaker Segregation and its Application in Multimedia Information Retrieval

A novel method of single channel speaker segregation using the group delay cross correlation function is proposed in this paper. The group delay function, which is the negative derivative of the phase spectrum, yields robust spectral estimates. Hence the group delay spectral estimates are first computed over frequency sub-bands after passing the speech signal through a bank of filters. The filter bank spacing is based on a multi-pitch algorithm that computes the pitch estimates of the competing speakers. An affinity matrix is then computed from the group delay spectral estimates of each frequency sub-band. This affinity matrix represents the correlations of the different sub-bands in the mixed broadband speech signal. The grouping of correlated harmonics present in the mixed speech signal is then carried out by using a new iterative graph cut method. The signals are reconstructed from the respective harmonic groups which represent individual speakers in the mixed speech signal. Spectrographic masks are then applied on the reconstructed signals to refine their perceptual quality. The quality of separated speech is evaluated using several objective and subjective criteria. Experiments on multi-speaker automatic speech recognition are conducted using mixed speech data from the GRID corpus. A cell phone based multimedia information retrieval system (MIRS) for multi-source meeting environments are also developed.

[1]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[2]  Xuejing Sun,et al.  Pitch determination and voice quality analysis using Subharmonic-to-Harmonic Ratio , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Meir Feder,et al.  Multi-channel signal separation by decorrelation , 1993, IEEE Trans. Speech Audio Process..

[4]  Richard M. Stern,et al.  Single-channel speech separation based on instantaneous frequency , 2010 .

[5]  Daniel P. W. Ellis,et al.  A computer implementation of psychoacoustic grouping rules , 1993, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 2 - Conference B: Computer Vision & Image Processing. (Cat. No.94CH3440-5).

[6]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[7]  David K. Mellinger,et al.  Event formation and separation in musical sound , 1992 .

[8]  Les E. Atlas,et al.  Coherent envelope detection for modulation filtering of speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Birger Kollmeier,et al.  PEMO-Q—A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Ingo R. Titze,et al.  Principles of voice production , 1994 .

[11]  Michael I. Jordan,et al.  Blind One-microphone Speech Separation: A Spectral Learning Approach , 2004, NIPS.

[12]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. , 1997, The Journal of the Acoustical Society of America.

[13]  B. Yegnanarayana,et al.  Significance of group delay functions in signal reconstruction from spectral magnitude or phase , 1984 .

[14]  T. W. Parsons Separation of speech from interfering speech by means of harmonic selection , 1976 .

[15]  Emmanuel Vincent,et al.  Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Michael I. Jordan,et al.  Discriminative training of hidden Markov models for multiple pitch tracking [speech processing examples] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[18]  Sridha Sridharan,et al.  Multichannel speech separation by eigendecomposition and its application to co-talker interference removal , 1997, IEEE Trans. Speech Audio Process..

[19]  B. Raj,et al.  Latent variable decomposition of spectrograms for single channel speaker separation , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[20]  Bayya Yegnanarayana,et al.  Significance of group delay functions in spectrum estimation , 1992, IEEE Trans. Signal Process..

[21]  Boualem Boashash,et al.  Estimating and interpreting the instantaneous frequency of a signal. II. A/lgorithms and applications , 1992, Proc. IEEE.

[22]  Rajesh M. Hegde,et al.  Single Channel Speaker Segregation using Sinusoidal Residual Modeling , 2009 .

[23]  Hema A. Murthy,et al.  The modified group delay function and its application to phoneme recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[24]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[25]  Ning Ma,et al.  Recent advances in speech fragment decoding techniques , 2006, INTERSPEECH.

[26]  Alfred Mertins,et al.  Quality assessment for listening-room compensation algorithms , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  B. Yegnanarayana Formant extraction from linear‐prediction phase spectra , 1978 .

[28]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[29]  Mahesh Viswanathan,et al.  Multimedia document retrieval using speech and speaker recognition , 2000, International Journal on Document Analysis and Recognition.

[30]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[31]  F. Cesbron Pitch detection using the short-term phase spectrum , 1992 .

[32]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[34]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[35]  Goujun Lu,et al.  Indexing and Retrieval of Audio: A Survey , 2001, Multimedia Tools and Applications.

[36]  Bhiksha Raj,et al.  Soft Mask Methods for Single-Channel Speaker Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Mitchel Weintraub,et al.  A computational model for separating two simultaneous talkers , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[39]  D. Grantham,et al.  Modulation masking: effects of modulation frequency, depth, and phase. , 1989, The Journal of the Acoustical Society of America.

[40]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[41]  Rajesh M. Hegde,et al.  Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing , 2007, EURASIP J. Audio Speech Music. Process..

[42]  Takao Kobayashi,et al.  Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous frequency , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[43]  Bhiksha Raj,et al.  Latent Dirichlet Decomposition for Single Channel Speaker Separation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[44]  Philipos C. Loizou,et al.  SNR loss: A new objective measure for predicting the intelligibility of noise-suppressed speech , 2011, Speech Commun..

[45]  Daniel P. W. Ellis,et al.  Combining localization cues and source model constraints for binaural source separation , 2011, Speech Commun..

[46]  Mitchel Weintraub,et al.  A theory and computational model of auditory monaural sound separation , 1985 .