Graph-Based Text Modeling: Considering Mathematical Semantic Linking to Improve the Indexation of Arabic Documents

Indexing unstructured documents aims to build a list of words, or concepts, which will simplify the exploration of their exploration later on. The most used model for text modeling is the Vector Space Model. In spite of the simplicity of this model in its implementation and its wide use in different researches in the field of text mining and information retrieval, it has an important limit, which is ignoring the semantic relation between the different textual units, by considering them as independent. However, there is a more suitable technique in Data Mining to highlight the semantic linkage between text units, which is the graph-based representation. A graph can easily be adapted to the textual data by representing words as a vertex and the relation between them as edges. In this work, we have introduced the graph based modeling of textual document. Thus, we conducted a study about the impact of the choice of the semantic relation between the text units on the indexation of documents. We have validated our results through classification results.

[1]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[2]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[3]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[4]  G. Kanaan,et al.  Support vector machine text classification system: Using Ant Colony Optimization based feature subset selection , 2008, 2008 International Conference on Computer Engineering & Systems.

[5]  Rehab Duwairi A Distance-based Classifier for Arabic Text Categorization , 2005, DMIN.

[6]  Riyad Al-Shalabi,et al.  Improving KNN Arabic Text Classification with N-Grams Based Document Indexing , 2008 .

[7]  J. Watada,et al.  An evidential reasoning based LSA approach to document classification for knowledge acquisition , 2010, 2010 IEEE International Conference on Industrial Engineering and Engineering Management.

[8]  Abdelwadood Moh'd. Mesleh Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study , 2007, SCSS.

[9]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[10]  Driss Mammass,et al.  A Semantic Proximity Based System of Arabic Text Indexation , 2010, ICISP.

[11]  Mohammed Benkhalifa,et al.  Integrating WordNet knowledge to supplement training data in semi‐supervised agglomerative hierarchical clustering for text categorization , 2001, Int. J. Intell. Syst..

[12]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[13]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.