MODELING ARABIC LANGUAGE USING STATISTICAL METHODS

In this paper we propose to investigate statistical language models for Arabic. First, several experiments using different smoothing techniques are carried out on a small corpus extracted from a daily newspaper. The sparseness of the data leads us to investigate other solutions without increasing the size of the corpus. A word segmentation technique has been employed in order to increase the statistical viability of the corpus. An n-morpheme model has been developed which leads to a better performance in terms of normalized perplexity. The second experiment concerns the study of the behaviour of statistical models based on different kinds of corpora. The introduction of distant n-gram improves the baseline model. Finally we propose a comparative study of statistical language models for Arabic and several foreign languages. The objective of this study is to understand how to better model each of this languages. For foreign languages, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with Witten Bell method are more efficient.

[1]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Kamel Smaïli,et al.  Reconnaissance Automatique de la Parole Du signal à son interprétation , 2006 .

[4]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[5]  Dimitra Vergyri,et al.  Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition , 2004 .

[6]  Karima Meftouh,et al.  Comparative Study of Arabic and French Statistical Language Models , 2009, ICAART.

[7]  S. Saraswathi,et al.  Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system , 2007, TALIP.

[8]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[10]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11]  Kamel Smaïli,et al.  Dealing with distant relationships in natural language modelling for automatic speech recognition , 2000 .

[12]  Kamel Smaïli,et al.  Improving language models by using distant information , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[13]  David Langlois Notions d'événements distants et d'évenements impossibles en modélisation stochastique du langage : application aux modèles n-grammes de mots et de séquences , 2002 .

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  Kamel Smaïli,et al.  Statistical language modeling based on variable-length sequences , 2003, Comput. Speech Lang..

[16]  Sanjeev Khudanpur,et al.  Cross-Lingual Lexical Triggers in Statistical Language Modeling , 2003, EMNLP.

[17]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..