Unsupervised Training on a Large Amount of Arabic Broadcast News Data

The unsupervised training we carried out on the 1,858-hour untranscribed Arabic broadcast news (BN) data yields a sizable gain. However, this gain is only about half of that achieved on the 1,900-hour English BN data. This paper presents our efforts that aim at enlarging the gain on the Arabic data. These efforts include a design of an explicit hypothesis-confidence-estimating method for the data selection, use of new features and neural networks (NN) to improve hypothesis-confidence estimation, and alleviation of the over-fitting problem existing in the estimation. Our experiments show that both the explicit hypothesis-confidence-estimating method and the use of new features improve the estimation and render the unsupervised training extra gains; the use of neural networks doesn't significantly improve the confidence estimation; the alleviation of the over-fitting problem is not significant enough to decrease the word error rate (WER). This paper also presents improvements of unsupervised training we conducted on a morpheme-based Arabic system and on models trained with maximum mutual information (MMI) criterion.

[1]  Richard M. Schwartz,et al.  Unsupervised Training on Large Amounts of Broadcast News Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Herbert Gish,et al.  Improved estimation, evaluation and applications of confidence measures for speech recognition , 1997, EUROSPEECH.

[3]  Bing Xiang,et al.  Morphological Decomposition for Arabic Broadcast News Transcription , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.