Speaker identification with distant microphone speech

The field of speaker identification has recently seen significant advancement, but improvements have tended to be benchmarked on near-field speech, ignoring the more realistic setting of far-field-instrumented speakers. In this work we present several findings on far-field speech from the MIXER5 Corpus, in the areas of feature extraction, speaker modeling, and multichannel score combination. First, we observe that minimum-variance distortionless response (MVDR) features outperform Mel-frequency cepstral coefficient (MFCC) features, and that fundamental frequency variation (FFV) features offer complimentary information to both MFCC and MVDR features. Second, we present evidence that factor analysis significantly improves system performance, compared to the more traditional GMM/UBM strategy. Third, we find that frame-based score competition significantly improves performance under mismatched conditions with multiple channels available.

[1]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[2]  Mattias Heldner,et al.  An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition , 1996 .

[4]  Tanja Schultz,et al.  Speaker identification using warped MVDR cepstral features , 2009, INTERSPEECH.

[5]  Nicholas W. D. Evans,et al.  ALIZE/spkdet: a state-of-the-art open source software for speaker recognition , 2008, Odyssey.

[6]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[7]  Patrick Kenny,et al.  Factor analysis simplified [speaker verification applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  A. Murat Tekalp,et al.  Joint audio-video processing for biometric speaker identification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[9]  Tanja Schultz,et al.  Far-Field Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Christopher Cieri,et al.  Speaker Recognition: Building the Mixer 4 and 5 Corpora , 2008, LREC.

[11]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[13]  Kornel Laskowski,et al.  Modeling instantaneous intonation for speaker identification using the fundamental frequency variation spectrum , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  M. Wolfel,et al.  Minimum variance distortionless response spectral estimation , 2005, IEEE Signal Processing Magazine.

[15]  G.R. Doddington,et al.  Speaker recognition—Identifying people by their voices , 1985, Proceedings of the IEEE.

[16]  D. O'Shaughnessy,et al.  Speaker recognition , 1986, IEEE ASSP Magazine.