Speaker Identification Method Using Earth Mover's Distance for CCC Speaker Recognition Evaluation 2006

In this paper, we present a non-parametric speaker identification method using Earth Mover's Distance (EMD) designed for text-indepedent speaker identification and its evaluation results for CCC Speaker Recognition Evaluation 2006, organized by the Chinese Corpus Consortium (CCC) for the th International Symposium on Chinese Spoken Language Processing (ISCSLP 2006). EMD based speaker identification (EMD-IR) was originally designed to be applied to a distributed speaker identification system, in which the feature vectors are compressed by vector quantization at a terminal and sent to a server that executes a pattern matching process. In this structure, we had to train speaker models using quantized data, then we utilized a non-parametric speaker model and EMD. From the experimental results on a Japanese speech corpus, EMD-IR showed higher robustness to the quantized data than the conventional GMM technique. Moreover, it achieved higher accuracy than GMM even if the data was not quantized. Hence, we have taken the challenge of CCC Speaker Recognition Evaluation 2006 using EMD-IR. Since the identification tasks defined in the evaluation were on an open-set basis, we introduce a new speaker verification module. Evaluation results show that EMD-IR achieves 99.3 % Identification Correctness Rate in a closed-channel speaker identification task.

[1]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Georg Rose,et al.  Improved noise robustness by corrective and rival training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Keiichi Tokuda,et al.  XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[4]  James Nga-Kwok Liu,et al.  A Hybrid Language Model Based On Statistics And Linguistic Rules , 2005, Int. J. Pattern Recognit. Artif. Intell..

[5]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[6]  Wang Yong The Three Principles of Computer Chinese Character Keyboard Design , 2005 .

[7]  Shiri Gordon,et al.  An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  Alex Park,et al.  ASR dependent techniques for speaker identification , 2002, INTERSPEECH.

[9]  Xiaolong Wang,et al.  Mining Pinyin-to-character conversion rules from large-scale corpus: a rough set approach , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[11]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[12]  Günther Palm,et al.  On the use of features from prediction residual signals in speaker identification , 1995, EUROSPEECH.

[13]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Marcel Vasilache,et al.  A combination of discriminative and maximum likelihood techniques for noise robust speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Zdravko Kacic,et al.  A study of harmonic features for the speaker recognition , 1997, Speech Commun..

[16]  D. M. Brookes,et al.  SPEAKER CHARACTERISTICS FROM A GLOTTAL AIRFLOW MODEL USING ROBUST INVERSE FILTERING , 1994 .

[17]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  H. Bunke,et al.  Hybrid approaches , 1988 .

[19]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[20]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[21]  Jianfeng Gao,et al.  The Use of Clustering Techniques for Language Modeling V Application to Asian Language , 2001, ROCLING/IJCLCLP.

[22]  T. Takezawa Speech and Language Database for Speech Translation Research in ATR , 1998 .

[23]  C. Tomasi The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval , 1997 .

[24]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[25]  Guodong Zhou,et al.  Word Association and MI-TRigger-based Language Modeling , 1998, COLING-ACL.

[26]  Wen-Lian Hsu,et al.  Applying Meaningful Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem , 2004, ROCLING.

[27]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[28]  Salim Roukos,et al.  Recent results on MT evaluation in the GALE program , 2006, IWSLT.

[29]  Jun Du,et al.  A New Minimum Divergence Approach to Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[30]  Larry P. Heck,et al.  Modeling dynamic prosodic variation for speaker verification , 1998, ICSLP.

[31]  B. Atal Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[32]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[33]  Ralf Schlüter,et al.  Investigations on discriminative training criteria , 2000 .

[34]  F. Pellandini,et al.  Distributed speaker recognition using the ETSI AURORA standard , 2002 .

[35]  Wei-Yun Yau,et al.  Fingerprint and speaker verification decisions fusion using a functional link network , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[36]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[37]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[38]  Rong Tong,et al.  The IIR Submission to CSLP 2006 Speaker Recognition Evaluation , 2006, ISCSLP.

[39]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[40]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[41]  Eiichiro Sumita,et al.  Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World , 2002, LREC.

[42]  Julian Fiérrez,et al.  On the use of quality measures for text-independent speaker recognition , 2004, Odyssey.

[43]  Heinz Hügli,et al.  Usefulness of the LPC-residue in text-independent speaker verification , 1995, Speech Commun..

[44]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[45]  Xiaolong Wang,et al.  Principles of Non-stationary Hidden Markov Model and Its Applications to Sequence Labeling Task , 2005, IJCNLP.

[46]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[47]  Chen Lang A NOVEL WORD CLUSTERING ALGORITHM AND VARI GRAM LANGUAGE MODEL , 1999 .

[48]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[49]  Hang Li,et al.  Word Clustering and Disambiguation Based on Co-occurrence Data , 1998, COLING.

[50]  Wen-Lian Hsu Chinese Parsing in a Phoneme-to-Character Conversion System Based on Semantic Pattern Matching , 1994 .

[51]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[52]  Masahide Sugiyama,et al.  Noise-robust HMMs based on minimum error classification , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[54]  C. Broun,et al.  Distributed speaker recognition using the ETSI distributed speech recognition standard , 2001 .

[55]  Harald Höge Project Proposal TC-STAR - Make Speech to Speech Translation Real , 2002, LREC.

[56]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[57]  I. J. Myung,et al.  Tutorial on maximum likelihood estimation , 2003 .

[58]  Hitoshi Iida,et al.  A speech and language database for speech translation research , 1994, ICSLP.

[59]  Kishore Prahallad,et al.  Source and system features for speaker recognition using AANN models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[60]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[61]  Alon Lavie,et al.  Janus-III: speech-to-speech translation in multiple languages , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[62]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[63]  Nengheng Zheng,et al.  Time -frequency analysis of vocal source signal for speaker recognition , 2004, INTERSPEECH.

[64]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[65]  Darren Pearce,et al.  Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities , 2000 .

[66]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[67]  Shiwen Yu,et al.  Specification for Corpus Processing at Peking University: Word Segmentation, POS Tagging and Phonetic Notation , 2003, J. Chin. Lang. Comput..

[68]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[69]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[70]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[71]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[72]  Shingo Kuroiwa,et al.  Effects of Phoneme Type and Frequency on Distributed Speaker Identification and Verification , 2006, IEICE Trans. Inf. Syst..

[73]  Shingo Kuroiwa,et al.  Determination of threshold for speaker verification using speaker adaptation gain in likelihood during training , 2000, INTERSPEECH.

[74]  Arun Ross,et al.  Information fusion in biometrics , 2003, Pattern Recognit. Lett..

[75]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[76]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[77]  Toshiyuki Takezawa,et al.  Collecting machine-translation-aided bilingual dialogues for corpus-based speech translation , 2003, INTERSPEECH.

[78]  Shingo Kuroiwa,et al.  Prank call rejection system for home country direct service , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[79]  Shingo Kuroiwa,et al.  Nonparametric Speaker Recognition Method Using Earth Mover's Distance , 2006, IEICE Trans. Inf. Syst..

[80]  Richard Sproat,et al.  A spoken language translator for restricted-domain context-free languages , 1992, Speech Commun..

[81]  Zheng Chen,et al.  A New Statistical Approach To Chinese Pinyin Input , 2000, ACL.

[82]  Toshiyuki Takezawa,et al.  A Comparative Study on Human Communication Behaviors and Linguistic Characteristics for Speech-to-Speech Translation , 2004, LREC.

[83]  Hsiao-Chuan Wang,et al.  Improvement of speaker recognition by combining residual and prosodic features with acoustic features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[84]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[85]  Max V. Mathews,et al.  Investigation of the Glottal Waveshape by Automatic Inverse Filtering , 1963 .

[86]  Jia-Lin Tsai Using Word Support Model to Improve Chinese Input System , 2006, ACL.

[87]  John D. Lafferty,et al.  Cluster Expansions and Iterative Scaling for Maximum Entropy Language Models , 1995, ArXiv.

[88]  Sun-Yuan Kung,et al.  Maximum Likelihood and Maximum a Posteriori Adaptation for Distributed Speaker Recognition Systems , 2004, ICBA.

[89]  Eiichiro Sumita,et al.  Comparative study on corpora for speech translation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[90]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[91]  Wang Jian-qi On Chinese Language Processing , 2003 .

[92]  Wen-Lian Hsu,et al.  Applying an NVEF Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem , 2002, COLING.

[93]  Masaki Naito,et al.  A comparative study on acoustic and linguistic characteristics using speech from human-to-human and human-to-machine conversations , 2000, INTERSPEECH.

[94]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[95]  Gianni Lazzari TC-STAR: a speech to speech translation project , 2006, IWSLT.

[96]  Thomas Fang Zheng,et al.  CCC Speaker Recognition Evaluation 2006: Overview, Methods, Data, Results and Perspective , 2006, ISCSLP.

[97]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[98]  H. Jeffreys,et al.  Theory of probability , 1896 .

[99]  Wei Yuan,et al.  Minimum Sample Risk Methods for Language Modeling , 2005, HLT/EMNLP.

[100]  Biing-Hwang Juang,et al.  A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[101]  Masaaki Nagata,et al.  ATR's speech translation system: ASURA , 1993, EUROSPEECH.

[102]  Jun Du,et al.  Minimum divergence based discriminative training , 2006, INTERSPEECH.

[103]  Vassilios Digalakis,et al.  Spoken language translation with MID-90's technology: a case study , 1993, EUROSPEECH.