A Novel Algorithm for Normalizing Noisy Arabic Text

In this paper, an algorithm to normalize noisy text, which only focuses on the Arabic language, is introduced. Although there have been many theories that discuss Arabic text processing, there has not been, so far, one theory that focuses on noisy Arabic texts. Additionally, this paper introduces a new similarity measure to stem Arabic noisy document. The need for such a new measure stems from the fact that the common rules applied in stemming cannot be applied on noisy texts, which do not conform to the known grammatical rules and have various spelling mistakes. Thus, the proposed normalization algorithm automatically group words after applying the similarity measure. In order to make sure of such a theory of algorithm, the new normalization technique is evaluated by the under-stemming errors reduction technique introduced by Paice.

[1]  Chris D. Paice Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[2]  Chris D. Paice An evaluation method for stemming algorithms , 1994, SIGIR '94.

[3]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[4]  Alexander M. Fraser,et al.  Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[5]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[6]  Martha W. Evens,et al.  Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System , 1994, J. Am. Soc. Inf. Sci..

[7]  Ravikumar Kondadadi,et al.  A word-based soft clustering algorithm for documents , 2001, Computers and Their Applications.

[8]  Ibrahim A. Al-Kharashi Micro-AIRS: a microcomputer-based arabic information retrieval system comparing words, stems, and roots as index terms , 1992 .

[9]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[10]  Jessica Lin,et al.  A novel Arabic lemmatization algorithm , 2008, AND '08.

[11]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[12]  Ricardo A. Baeza-Yates,et al.  Text-Retrieval: Theory and Practice , 1992, IFIP Congress.

[13]  Amna A. Al Kaabi,et al.  Arabic Light Stemmer : Anew Enhanced Approach , 2005 .

[14]  C. Huyck,et al.  A stemming algorithm for the portuguese language , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[15]  Ophir Frieder,et al.  On arabic search: the effectiveness of monolingual and bidirectional information retrieval , 2002 .

[16]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .