Word Sense Representation based-method for Arabic Text Categorization

Word embedding and document embedding representations have proved good performance in several natural language tasks such as information retrieval, text categorization, etc. Since, these representations capture both the synthetic and semantic information contained in the text. However, the ambiguity of Arabic language is a ubiquitous problem that can be handled by using sense embedding approaches. Learning a distinct representation for each sense of an ambiguous word could lead to train a more powerful and fine-grained models of vector-space representations. In this paper, we propose a method that combined document embedding representation and sense disambiguation to enhance Arabic text representation. Firstly, we benefit from sense embedding. Secondly, we enrich the document embedding model by learning representations for words and their senses and increasing the number of the document to train the model. The proposed method will be explored in the context of Arabic text categorization. Our system is composed of four stages which are: (1) preprocessing, (2) word sense disambiguation, (3) document embedding, (4) document categorization. We have conducted several experiments on OSAC corpus, the obtained results show that our method outperforms the state of art methods.