论文信息 - Theme Classification of Arabic Text: A Statistical Approach

Theme Classification of Arabic Text: A Statistical Approach

The huge amount of textual documents that is stored in a lot of domains continues to increase at high speed; there is a need to organize it in the right manner so that a user can access it very easily. Text-Mining tools help to process this growing big data and to reveal the important information embedded in those documents. However, the field of information retrieval in the Arabic language is relatively new and limited compared to the quantity of research works that have been done in other languages (eg. English, Greek, German, Chinese ...). In this paper, we propose two statistical approaches of text classification by theme, which are dedicated to the Arabic language. The tests of evaluation are conducted on an Arabic textual corpus containing 5 different themes: Economics, Politics, Sport, Medicine and Religion. This investigation has validated several text mining tools for the Arabic language and has shown that the two proposed approaches are interesting in Arabic theme classification (classification performance reaching the score of 95%).

[1] Abdelsalam Abdelhamid Almarimi,et al. Heuristic Lemmatization for Arabic Texts Indexation and Classification , 2010 .

[2] Mofleh Al-Diabat,et al. Arabic Text Categorization Using Classification Rule Mining , 2012 .

[3] Jerome H. Friedman,et al. DATA MINING AND STATISTICS: WHAT''S THE CONNECTION , 1997 .

[4] Nandita Tripathi,et al. Two-level text classification using hybrid machine learning techniques , 2012 .

[5] Fouzi Harrag,et al. Neural Network for Arabic text classification , 2009, 2009 Second International Conference on the Applications of Digital Information and Web Technologies.