Identifying non-linguistic speech features

SUMMARY In this paper we have presented a unified approach forthe identification of non-linguistic speech features fromrecorded signals using phone-based acoustic likelihoods.The inclusion of this technique in speech-based systems,can broaden the scope of applications of speech technolo-gies, and lead to more user-friendly systems. The approachis based on training a set of large phone-based ergodicHMMs for each non-linguisticfeature to be identified (lan-guage, gender, speaker, ...), and identifying the feature asthat associated with the model having the highest acousticlikelihoodof the set. The decoding procedure is efficientlyimplemented by processing all the models in parallel usinga time-synchronous beam search strategy.This has been shown to be a powerful technique for sex,language, and speaker-identification, and has other possi-ble applications such as for dialect identification (includ -ing foreign accents), or identification of speech disfluen-cies. Sex-identification for BREF and WSJ was error-free,and 99% accurate for TIMIT with 2s of speech. Speakeridentification accuracies of 98.8% on TIMIT (168 speak-ers) and 99.1% on BREF (65 speakers) were obtained withone utterance per speaker, and 100% if 2 utterances wereused foridentification. This identificationaccuracy was ob -tained on the 168 test speakers of TIMIT without makinguse of the phonetic transcriptionsduring training,verifyingthat it is not necessary to have labeled data adaptation data.Speaker-independent models can be used to provide the la-bels used in building the speaker-specific models. Beingindependent of the spoken text, and requiring only a smallamount of identification speech (on the order of 2.5s), thistechnique is promising for a variety of applications, partic-ularly those for which continual, transparent verification ispreferable.Tests of two-way language identification of read, labora-toryspeech show that with 2sof speech the languageis cor-rectly identified as English or French with over 99% accu-racy. Simply portingthe approach to the conditionsof tele-phone speech, French and English data in the OGI multi-language telephone speech corpus was about 76% with 2sof speech, and increased to 82% with 10s. The overall 10-languageidentificationaccuracy on thedesignateddevelop -ment test data of in the OGI corpus is 59.7%. These resultswere obtained without the use of phone transcriptions fortraining, which were used for the experiments with labora-tory speech.In conclusion, we propose a unified approach to iden-tifying non-linguistic speech features from the recordedsignal using phone-based acoustic likelihoods. This tech-nique has been shown to be effective for text-independent,vocabulary-independent sex, speaker, and language identi-fication. While phone labels have been used to train thespeaker-independent seed models, these models can thenbe used to label unknown speech, thus avoiding the costlyprocess of transcribing the speech data. The ability to ac-curately identify non-linguisticspeech features can leadtomore performant spoken language systems enabling betterand more friendly human machine interaction.

[1]  A. B. Poritz,et al.  Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[2]  Younès Bennani Speaker identification through a modular connectionist architecture: evaluation on the timit database , 1992, ICSLP.

[3]  Chin-Hui Lee,et al.  Bayesian Learning of Gaussian Mixture Densities for Hidden Markov Models , 1991, HLT.

[4]  Jean-Luc Gauvain,et al.  Continuous Speech Recognition at LIMSI , 1992 .

[5]  Sadaoki Furui,et al.  Concatenated phoneme models for text-variable speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Frank K. Soong,et al.  Continuous probabilistic acoustic map for speaker identification , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Sadaoki Furui,et al.  Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[9]  G.R. Doddington,et al.  Speaker recognition—Identifying people by their voices , 1985, Proceedings of the IEEE.

[10]  M. Eskenazi,et al.  The French language database: Defining, planning, and recording a large database , 1984, ICASSP.

[11]  Stephen A. Zahorian,et al.  Text-independent talker identification with neural networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[13]  Ronald A. Cole,et al.  The OGI multi-language telephone speech corpus , 1992, ICSLP.

[14]  Jean-Luc Gauvain,et al.  Speaker-Independent Phone Recognition Using BREF , 1992, HLT.

[15]  Ke Wu,et al.  Automatic recognition of gender by voice , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[16]  Maxine Eskénazi,et al.  Design considerations and text selection for BREF, a large French read-speech corpus , 1990, ICSLP.

[17]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[18]  Seiichi Nakagawa,et al.  Speaker-independent, text-independent language identification by HMM , 1992, ICSLP.

[19]  Aaron E. Rosenberg,et al.  Sub-word unit talker verification using hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[20]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[21]  Douglas A. Reynolds,et al.  Text independent speaker identification using automatic acoustic segmentation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[22]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[23]  J. W. Fussell Automatic sex identification from short segments of speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Chin-Hui Lee,et al.  Bayesian learning for hidden Markov model with Gaussian mixture state observation densities , 1991, Speech Commun..

[25]  Claude Montacié,et al.  AR-vector models for free-text speaker recognition , 1992, ICSLP.

[26]  F. J. Goodman,et al.  Improved automatic language identification in noisy speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[27]  M. Sugiyama,et al.  Automatic language recognition using acoustic features , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[28]  J. Foil,et al.  Language identification using noisy speech , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Jean-Luc Gauvain,et al.  Cross-lingual experiments with phone recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Russell B. Ives,et al.  Development of an automatic identification system of spoken languages: Phase I , 1982, ICASSP.

[31]  Ronald A. Cole,et al.  Automatic segmentation and identification of ten languages using telephone speech , 1992, ICSLP.

[32]  Mei-Yuh Hwang,et al.  Improved Hidden Markov Modeling for Speaker-Independent Continuous Speech Recognition , 1990, HLT.

[33]  Marc A. Zissman,et al.  Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[35]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[36]  J.M. Naik,et al.  Speaker verification: a tutorial , 1990, IEEE Communications Magazine.

[37]  Jean-Luc Gauvain,et al.  Speech-To-Text Conversion in French , 1994, Int. J. Pattern Recognit. Artif. Intell..

[38]  Sadaoki Furui,et al.  Speaker recognition using concatenated phoneme models , 1992, ICSLP.

[39]  A. House,et al.  Toward automatic identification of the language of an utterance. I. Preliminary methodological con , 1977 .

[40]  Jean-Luc Gauvain,et al.  High performance speaker-independent phone recognition using CDHMM , 1993, EUROSPEECH.

[41]  T. J. Edwards,et al.  Statistical models for automatic language identification , 1980, ICASSP.

[42]  Chin-Hui Lee,et al.  MAP Estimation of Continuous Density HMM : Theory and Applications , 1992, HLT.

[43]  Jean-Luc Gauvain,et al.  Identification of Non-Linguistic Speech Features , 1993, HLT.

[44]  Naftali Z. Tisby On the application of mixture AR hidden Markov models to text independent speaker recognition , 1991, IEEE Trans. Signal Process..