Transcription of Russian conversational speech

This paper presents initial work in transcribing conversational telephone speech in Russian. Acoustic seed models were derived from other languages. The initial studies are carried out with 9 hours of transcribed data, and explore the choice of the phone set and use of other data types to improve transcription performance. Discriminant features produced by a Multi Layer Perceptron trained on a few hours of Russian conversational data are contrasted with those derived from well-trained networks for English telephone speech and from Russian broadcast data. Acoustic models trained on broadcast data filtered to match the telephone band achieve results comparable to those obtained with models trained on the small conversation telephone speech corpus.

[1]  Andrey Ronzhin,et al.  Very Large Vocabulary ASR for Spoken Russian with Syntactic and Morphemic Analysis , 2011, INTERSPEECH.

[2]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[3]  Jean-Luc Gauvain,et al.  Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[4]  Hermann Ney,et al.  Cross domain automatic transcription on the TC-STAR EPPS corpus , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Richard M. Schwartz,et al.  Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Jean-Luc Gauvain,et al.  On the Use of MLP Features for Broadcast News Transcription , 2008, TSD.

[7]  Andrey Ronzhin,et al.  Survey of Russian Speech Recognition Systems , 2006 .

[8]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[9]  Richard M. Schwartz,et al.  Unsupervised versus supervised training of acoustic models , 2008, INTERSPEECH.

[10]  Pavel Matejka,et al.  Towards Lower Error Rates in Phoneme Recognition , 2004, TSD.

[11]  Jean-Luc Gauvain,et al.  Lattice-based unsupervised acoustic model training , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[14]  John Makhoul,et al.  Using quick transcriptions to improve conversational speech models , 2004, INTERSPEECH.

[15]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[16]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[17]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[18]  Richard M. Schwartz,et al.  Advances in transcription of broadcast news , 1997, EUROSPEECH.

[19]  Ngoc Thang Vu,et al.  Speech recognition for machine translation in Quaero , 2011, IWSLT.

[20]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[21]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.