Robust speaker recognition

The automatic speaker recognition technologies have developed into more and more important modern technologies required by many speech-aided applications. The main challenge for automatic speaker recognition is to deal with the variability of the environments and channels from where the speech was obtained. In previous work, good results have been achieved for clean high-quality speech with matched training and test acoustic conditions, such as high accuracy of speaker identification and verification using clean wideband speech and Gaussian Mixture Models (GMM). However, under mismatched conditions and noisy environments, often expected in real-world conditions, the performance of GMM-based systems degrades significantly, far away from the satisfactory level. Therefore, robustness becomes a crucial research issue in speaker recognition field. In this thesis, our main focus is to-improve the robustness of speaker recognition systems on far-field distant microphones. We investigate approaches to improve robustness from two directions. First, we investigate approaches to improve robustness for traditional speaker recognition system which is based on low-level spectral information. We introduce a new reverberation compensation approach which, along with feature warping in the feature processing procedure, improves the system performance significantly. We propose four multiple channel combination approaches, which utilize information from multiple far-field microphones, to improve robustness under mismatched training-testing conditions. Secondly, we investigate approaches to use high-level speaker information to improve robustness. We propose new techniques to model speaker pronunciation idiosyncrasy from two dimensions: the cross-stream dimension and the time dimension. Such high-level information is expected to be robust under different mismatched conditions. We also built systems that support robust speaker recognition. We implemented a speaker segmentation and clustering system aiming at improving the robustness of speaker recognition as well as automatic speech recognition performance in the multiple-speaker scenarios such as telephony conversations and meetings. We also integrate speaker identification modality with face recognition modality to build a robust person identification system.

[1]  J. Wolf Efficient Acoustic Parameters for Speaker Recognition , 1972 .

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[4]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[5]  H. Akaike A new look at the statistical model identification , 1974 .

[6]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[7]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[10]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[11]  M. Hunt Further experiments in text-independent speaker recognition over communications channels , 1983, ICASSP.

[12]  Francis Nolan,et al.  The Phonetic Bases of Speaker Recognition , 1983 .

[13]  D. Lancker,et al.  Familiar voice recognition: Patterns and parameters. Part I. Recognition of backward voices , 1985 .

[14]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Biing-Hwang Juang,et al.  The use of cohort normalized scores for speaker verification , 1992, ICSLP.

[17]  M. Sugiyama,et al.  Speech segmentation and clustering based on speaker features , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Frédéric Bimbot,et al.  Text-free speaker recognition using an arithmetic-harmonic sphericity measure , 1993, EUROSPEECH.

[19]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[20]  Ea-Ee Jan,et al.  Microphone arrays and speaker identification , 1994, IEEE Trans. Speech Audio Process..

[21]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[22]  Alex Pentland,et al.  View-based and modular eigenspaces for face recognition , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[23]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[24]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[25]  Mark J. F. Gales,et al.  Robust speech recognition in additive and convolutional noise using parallel model combination , 1995, Comput. Speech Lang..

[26]  Javier Ortega-Garcia,et al.  Increasing robustness in GMM speaker recognition systems for noisy and reverberant speech with low complexity microphone arrays , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[27]  Steve Young,et al.  The development of the 1996 HTK broadcast news transcription system , 1996 .

[28]  Aaron E. Rosenberg,et al.  Speaker background models for connected digit password speaker verification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[29]  Sadaoki Furui,et al.  An Overview of Speaker Recognition Technology , 1996 .

[30]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[31]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[32]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[33]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[34]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[35]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[36]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[37]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[38]  Steve Young,et al.  Segment generation and clustering in the HTK broadcast news transcription system , 1998 .

[39]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[40]  Thomas H. Crystal,et al.  Human vs. machine speaker identification with telephone speech , 1998, ICSLP.

[41]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[42]  Sue E. Johnson,et al.  Who spoke when? - automatic segmentation and clustering for determining speaker turns , 1999, EUROSPEECH.

[43]  Christian Wellekens,et al.  Audio data indexing: Use of second-order statistics for speaker-based segmentation , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[44]  Ramesh A. Gopinath,et al.  Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[45]  Daniel P. W. Ellis,et al.  Using acoustic condition clustering to improve acoustic change detection on broadcast news , 2000, INTERSPEECH.

[46]  John H. L. Hansen,et al.  Unsupervised audio stream segmentation and clustering via the Bayesian information criterion , 2000, INTERSPEECH.

[47]  Thomas H. Crystal,et al.  Speaker Verification by Human Listeners: Experiments Comparing Human and Machine Performance Using the NIST 1998 Speaker Evaluation Data , 2000, Digit. Signal Process..

[48]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[49]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[50]  Alexander H. Waibel,et al.  Strategies for automatic segmentation of audio data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[51]  Christian Wellekens,et al.  A speaker tracking system based on speaker turn detection for NIST evaluation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[52]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[53]  Yizhar Lavner,et al.  The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels , 2000, Speech Commun..

[54]  Douglas A. Reynolds,et al.  Estimation of handset nonlinearity with application to speaker recognition , 2000, IEEE Trans. Speech Audio Process..

[55]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[56]  George R. Doddington,et al.  Speaker recognition based on idiolectal differences between speakers , 2001, INTERSPEECH.

[57]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[58]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[59]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[60]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[61]  Seiichi Nakagawa,et al.  Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[62]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[63]  Aaron E. Rosenberg,et al.  Unsupervised speaker segmentation of telephone conversations , 2002, INTERSPEECH.

[64]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[65]  Tanja Schultz,et al.  Speaker identification using multilingual phone strings , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  Tanja Schultz,et al.  Improvements in Non-Verbal Cue Identification Using Multilingual Phone Strings , 2002, Speech-to-Speech Translation@ACL.

[67]  Ramesh A. Gopinath,et al.  Short-time Gaussianization for robust speaker verification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[68]  Susanne Burger,et al.  The ISL meeting corpus: the impact of meeting type on speech style , 2002, INTERSPEECH.

[69]  Joseph P. Campbell,et al.  Gender-dependent phonetic refraction for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[70]  Qin Jin,et al.  Phonetic speaker recognition using maximum-likelihood binary-decision tree models , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[71]  Douglas A. Reynolds,et al.  Combining cross-stream and time dimensions in phonetic speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[72]  Douglas A. Reynolds,et al.  The SuperSID project: exploiting high-level information for high-accuracy speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[73]  Rohit Prasad,et al.  BBN CTS English System , 2003 .

[74]  Jing Huang,et al.  Impact of audio segmentation and segment clustering on automated transcription accuracy of large spoken archives , 2003, INTERSPEECH.

[75]  Tsuhan Chen,et al.  Improved Audio-Visual Speaker Recognition via the Use of a Hybrid Combination Strategy , 2003, AVBPA.

[76]  Satoshi Nakamura,et al.  Model based noisy speech recognition with environment parameters estimated by noise adaptive speech recognition with prior , 2003, INTERSPEECH.

[77]  Jean-Pierre Martens,et al.  A fast, accurate and stream-based speaker segmentation and clustering algorithm , 2003, INTERSPEECH.

[78]  Itshak Lapidot SOM as likelihood estimator for speaker clustering , 2003, INTERSPEECH.

[79]  Douglas A. Reynolds,et al.  Modeling prosodic dynamics for speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[80]  Douglas A. Reynolds,et al.  Conditional pronunciation modeling in speaker detection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[81]  Alan Mink,et al.  Multimodal Biometric Authentication Methods: A COTS Approach | NIST , 2003 .

[82]  Ralph Gross,et al.  Person identification using automatic integration of speech, lip, and face experts , 2003, WBMA '03.

[83]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[84]  William M. Campbell,et al.  Phonetic Speaker Recognition with Support Vector Machines , 2003, NIPS.

[85]  Tieniu Tan,et al.  Combining Fingerprint and Voiceprint Biometrics for Identity Verification: an Experimental Comparison , 2004, ICBA.

[86]  Andreas Stolcke,et al.  The ICSI Meeting Project: Resources and Research , 2004 .

[87]  Tanja Schultz,et al.  Crosscorrelation-based multispeaker speech activity detection , 2004, INTERSPEECH.

[88]  Vincent M. Stanford,et al.  Beyond Close-talk-Issues in Distant speech Acquistion, Conditioning Classification, and Recognitio , 2006 .

[89]  Tanja Schultz,et al.  Issues in meeting transcription - the ISL meeting transcription system , 2004, INTERSPEECH.

[90]  Tanja Schultz,et al.  The ISL RT04 Mandarin Broadcast News Evaluation System , 2004 .

[91]  Susanne Burger,et al.  THE ISL MEETING CORPUS: CATEGORICAL FEATURES OF COMMUNICATIVE GROUP INTERACTIONS , 2004 .

[92]  Alvin F. Martin,et al.  The NIST speaker recognition evaluation program , 2005 .

[93]  Alexander H. Waibel CHIL - Computers in the Human Interaction Loop , 2005, MVA.

[94]  Andreas Stolcke,et al.  Improved phonetic speaker recognition using lattice decoding , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[95]  Martin Raab,et al.  The ISL TC-STAR Spring 2006 ASR Evaluation Systems , 2006 .

[96]  Tanja Schultz,et al.  Multilingual Speech Processing , 2006 .

[97]  Alessandro Vinciarelli Sociometry based Multiparty Audio Recordings Segmentation , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[98]  Qin Jin,et al.  ISL Person Identification Systems in the CLEAR Evaluations , 2006, CLEAR.

[99]  Jithendra Vepa,et al.  The segmentation of multi-channel meeting recordings for automatic speech recognition , 2006, INTERSPEECH.

[100]  Tanja Schultz,et al.  Far-Field Speaker Recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.