论文信息 - Transcription of arabic broadcast news

Transcription of arabic broadcast news

This paper describes recent research on transcribing Modern Standard Arabic broadcast news data. The Arabic language presents a number of challenges for speech recognition, arising in part from the significant differences in the spoken and written forms, in particular the conventional form of texts being non-vowelized. Arabic is a highly inflected language where articles and affixes are added to roots in order to change the word’s meaning. A corpus of 50 hours of audio data from 7 television and radio sources and 200 M words of newspaper texts were used to train the acoustic and language models. The transcription system based on these models and a vowelized dictionary obtains an average word error rate on a test set comprised of 12 hours of test data from 8 sources is about 18%.

Jean-Luc Gauvain | Lori Lamel | Abdelkhalek Messaoudi

[1] Jean-Luc Gauvain,et al. Automatic processing of broadcast audio in multiple languages , 2002, 2002 11th European Signal Processing Conference.

[2] Jean-Luc Gauvain,et al. The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[3] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4] Jean-Luc Gauvain,et al. Fast decoding for indexation of broadcast data , 2000, INTERSPEECH.

[5] J. Xu,et al. Audio Indexing of Arabic broadcast news , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Mark Liberman,et al. Transcriber: Development and use of a tool for assisting speech corpora production , 2001, Speech Commun..