Arabic Text Data Mining: a Root-Based Hierarchical Indexing Model

Abstract The world has recently witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. Text data mining, as a multidisciplinary field involving information retrieval, text analysis, information extraction, clustering, categorization, linguistics, database technology, machine learning, and data mining, is becoming more significant, and efforts have been intensified in studies like information retrieval, practical applications of which are becoming more and more necessary to end users and to the scientific community itself, in order to fetch the increasingly available information efficiently. In the past few years, not only have new documents been produced directly in digital form, thus being suitable for automatic indexing, but also many of the older documents have been ported from their physical medium to the digital one. The meaning of a document is represented by a vector of features, which are weighted according to a measure that best estimate relevance. Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, and attributes dependencies. This article focuses on speeding up the information retrieval process in Arabic document base by using a root-based hierarchical indexing model. Simulation results demonstrated that speed gain in the range of 50-100 can be achieved for typical queries.

[1]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[2]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[3]  Teruko Mitamura,et al.  Arabic Morphology Generation Using a Concatenative Strategy , 2000, ANLP.

[4]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[5]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[6]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[7]  Fabrizio Sebastiani,et al.  A Tutorial on Automated Text Categorisation , 2000 .

[8]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[9]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..