Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs

This paper proposes an approach to segmenting and identifying mixed-language speech. A delta Bayesian information criterion (delta-BIC) is firstly applied to segment the input speech utterance into a sequence of language-dependent segments using acoustic features. A VQ-based bi-gram model is used to characterize the acoustic-phonetic dynamics of two consecutive codewords in a language. Accordingly the language-specific acoustic-phonetic property of sequence of phones was integrated in the identification process. A Gaussian mixture model (GMM) is used to model codeword occurrence vectors orthonormally transformed using latent semantic analysis (LSA) for each language-dependent segment. A filtering method is used to smooth the hypothesized language sequence and thus eliminate noise-like components of the detected language sequence generated by the maximum likelihood estimation. Finally, a dynamic programming method is used to determine globally the language boundaries. Experimental results show that for Mandarin, English, and Taiwanese, a recall rate of 0.87 for language boundary segmentation was obtained. Based on this recall rate, the proposed approach achieved language identification accuracies of 92.1% and 74.9% for single-language and mixed-language speech, respectively.

[1]  A. Waibel,et al.  Multilinguality in speech and spoken language systems , 2000, Proceedings of the IEEE.

[2]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[3]  Chung-Hsien Wu,et al.  台語多聲調音節合成單元資料庫暨文字轉語音雛形系統之發展 (Establish Taiwanese 7-Tones Syllable-based Synthesis Units Database for the Prototype Development of Text-To-Speech System) [In Chinese] , 1999, ROCLING.

[4]  Imre Kiss,et al.  Noise robust speech parameterization using multiresolution feature extraction , 2001, IEEE Trans. Speech Audio Process..

[5]  Alvin F. Martin,et al.  NIST 2003 language recognition evaluation , 2003, INTERSPEECH.

[6]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[7]  Wen-Whei Chang,et al.  Discriminative training of Gaussian mixture bigram models with application to Chinese dialect identification , 2002, Speech Commun..

[8]  Mauro Cettolo,et al.  MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION , 2001 .

[9]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[10]  Alvin F. Martin,et al.  The Current State of Language Recognition: NIST 2005 Evaluation Results , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[11]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[12]  Hsiao-Chuan Wang,et al.  Joint estimation of feature transformation parameters and Gaussian mixture model for speaker identification , 1999, Speech Commun..

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  François Pellegrino,et al.  Automatic language identification: an alternative approach to phonetic modelling , 2000, Signal Process..

[15]  Lawrence K. Saul,et al.  Maximum likelihood and minimum classification error factor analysis for automatic speech recognition , 2000, IEEE Trans. Speech Audio Process..

[16]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[17]  Arjun K. Gupta,et al.  Parametric Statistical Change Point Analysis , 2000 .

[18]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Yeshwant K. Muthusamy,et al.  A Segmental Approach to Automatic Language Identification , 1993 .

[20]  Joachim Köhler Multilingual phone models for vocabulary-independent speech recognition tasks , 2001, Speech Commun..

[21]  David E. Booth,et al.  Multivariate statistical inference and applications , 1997 .

[22]  Chung-Hsien Wu,et al.  Generation of robust phonetic set and decision tree for Mandarin using chi-square testing , 2002, Speech Commun..

[23]  Jirí Navrátil,et al.  Spoken language recognition-a step toward multilinguality in speech processing , 2001, IEEE Trans. Speech Audio Process..

[24]  Ramesh A. Gopinath,et al.  Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[25]  Steven Kay,et al.  Fundamentals Of Statistical Signal Processing , 2001 .

[26]  Ronald A. Cole,et al.  A segment-based approach to automatic language identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.