论文信息 - Identifying non-linguistic speech features

Identifying non-linguistic speech features

SUMMARY In this paper we have presented a uniﬁed approach forthe identiﬁcation of non-linguistic speech features fromrecorded signals using phone-based acoustic likelihoods.The inclusion of this technique in speech-based systems,can broaden the scope of applications of speech technolo-gies, and lead to more user-friendly systems. The approachis based on training a set of large phone-based ergodicHMMs for each non-linguisticfeature to be identiﬁed (lan-guage, gender, speaker, ...), and identifying the feature asthat associated with the model having the highest acousticlikelihoodof the set. The decoding procedure is efﬁcientlyimplemented by processing all the models in parallel usinga time-synchronous beam search strategy.This has been shown to be a powerful technique for sex,language, and speaker-identiﬁcation, and has other possi-ble applications such as for dialect identiﬁcation (includ -ing foreign accents), or identiﬁcation of speech disﬂuen-cies. Sex-identiﬁcation for BREF and WSJ was error-free,and 99% accurate for TIMIT with 2s of speech. Speakeridentiﬁcation accuracies of 98.8% on TIMIT (168 speak-ers) and 99.1% on BREF (65 speakers) were obtained withone utterance per speaker, and 100% if 2 utterances wereused foridentiﬁcation. This identiﬁcationaccuracy was ob -tained on the 168 test speakers of TIMIT without makinguse of the phonetic transcriptionsduring training,verifyingthat it is not necessary to have labeled data adaptation data.Speaker-independent models can be used to provide the la-bels used in building the speaker-speciﬁc models. Beingindependent of the spoken text, and requiring only a smallamount of identiﬁcation speech (on the order of 2.5s), thistechnique is promising for a variety of applications, partic-ularly those for which continual, transparent veriﬁcation ispreferable.Tests of two-way language identiﬁcation of read, labora-toryspeech show that with 2sof speech the languageis cor-rectly identiﬁed as English or French with over 99% accu-racy. Simply portingthe approach to the conditionsof tele-phone speech, French and English data in the OGI multi-language telephone speech corpus was about 76% with 2sof speech, and increased to 82% with 10s. The overall 10-languageidentiﬁcationaccuracy on thedesignateddevelop -ment test data of in the OGI corpus is 59.7%. These resultswere obtained without the use of phone transcriptions fortraining, which were used for the experiments with labora-tory speech.In conclusion, we propose a uniﬁed approach to iden-tifying non-linguistic speech features from the recordedsignal using phone-based acoustic likelihoods. This tech-nique has been shown to be effective for text-independent,vocabulary-independent sex, speaker, and language identi-ﬁcation. While phone labels have been used to train thespeaker-independent seed models, these models can thenbe used to label unknown speech, thus avoiding the costlyprocess of transcribing the speech data. The ability to ac-curately identify non-linguisticspeech features can leadtomore performant spoken language systems enabling betterand more friendly human machine interaction.

Jean-Luc Gauvain | Lori Lamel | J. Gauvain | L. Lamel

[1] A. B. Poritz,et al. Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[2] Younès Bennani. Speaker identification through a modular connectionist architecture: evaluation on the timit database , 1992, ICSLP.

[3] Chin-Hui Lee,et al. Bayesian Learning of Gaussian Mixture Densities for Hidden Markov Models , 1991, HLT.

[4] Jean-Luc Gauvain,et al. Continuous Speech Recognition at LIMSI , 1992 .

[5] Sadaoki Furui,et al. Concatenated phoneme models for text-variable speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Frank K. Soong,et al. Continuous probabilistic acoustic map for speaker identification , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7] Sadaoki Furui,et al. Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[9] G.R. Doddington,et al. Speaker recognition—Identifying people by their voices , 1985, Proceedings of the IEEE.

[10] M. Eskenazi,et al. The French language database: Defining, planning, and recording a large database , 1984, ICASSP.

[11] Stephen A. Zahorian,et al. Text-independent talker identification with neural networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[12] Maxine Eskénazi,et al. BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[13] Ronald A. Cole,et al. The OGI multi-language telephone speech corpus , 1992, ICSLP.

[14] Jean-Luc Gauvain,et al. Speaker-Independent Phone Recognition Using BREF , 1992, HLT.

[15] Ke Wu,et al. Automatic recognition of gender by voice , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[16] Maxine Eskénazi,et al. Design considerations and text selection for BREF, a large French read-speech corpus , 1990, ICSLP.

[17] Hsiao-Wuen Hon,et al. Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[18] Seiichi Nakagawa,et al. Speaker-independent, text-independent language identification by HMM , 1992, ICSLP.

[19] Aaron E. Rosenberg,et al. Sub-word unit talker verification using hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[20] B.S. Atal,et al. Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[21] Douglas A. Reynolds,et al. Text independent speaker identification using automatic acoustic segmentation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[22] A.E. Rosenberg,et al. Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[23] J. W. Fussell. Automatic sex identification from short segments of speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[24] Chin-Hui Lee,et al. Bayesian learning for hidden Markov model with Gaussian mixture state observation densities , 1991, Speech Commun..

[25] Claude Montacié,et al. AR-vector models for free-text speaker recognition , 1992, ICSLP.

[26] F. J. Goodman,et al. Improved automatic language identification in noisy speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[27] M. Sugiyama,et al. Automatic language recognition using acoustic features , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[28] J. Foil,et al. Language identification using noisy speech , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29] Jean-Luc Gauvain,et al. Cross-lingual experiments with phone recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30] Russell B. Ives,et al. Development of an automatic identification system of spoken languages: Phase I , 1982, ICASSP.

[31] Ronald A. Cole,et al. Automatic segmentation and identification of ten languages using telephone speech , 1992, ICSLP.

[32] Mei-Yuh Hwang,et al. Improved Hidden Markov Modeling for Speaker-Independent Continuous Speech Recognition , 1990, HLT.

[33] Marc A. Zissman,et al. Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[35] L. R. Rabiner,et al. Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[36] J.M. Naik,et al. Speaker verification: a tutorial , 1990, IEEE Communications Magazine.

[37] Jean-Luc Gauvain,et al. Speech-To-Text Conversion in French , 1994, Int. J. Pattern Recognit. Artif. Intell..

[38] Sadaoki Furui,et al. Speaker recognition using concatenated phoneme models , 1992, ICSLP.

[39] A. House,et al. Toward automatic identification of the language of an utterance. I. Preliminary methodological con , 1977 .

[40] Jean-Luc Gauvain,et al. High performance speaker-independent phone recognition using CDHMM , 1993, EUROSPEECH.

[41] T. J. Edwards,et al. Statistical models for automatic language identification , 1980, ICASSP.

[42] Chin-Hui Lee,et al. MAP Estimation of Continuous Density HMM : Theory and Applications , 1992, HLT.

[43] Jean-Luc Gauvain,et al. Identification of Non-Linguistic Speech Features , 1993, HLT.

[44] Naftali Z. Tisby. On the application of mixture AR hidden Markov models to text independent speaker recognition , 1991, IEEE Trans. Signal Process..