The Automatic Categorization of Arabic Documents by Boosting Decision Trees

Automatic document classification has been subject to research since the early 1960s. However, additional research is still required and possible because the results obtained until now remain subject to further enhancement and refinement. Although a lot of literature has been written on the subject, very little research was reported on the automatic classification of Arabic documents none of which applied the technique of Boosting. In addition, Arabic is a highly inflective language and is morphologically much more complex than languages written with Latin characters. One cannot, therefore, easily take for granted that using Boosting to automatically classify Arabic documents is as effective as it is with documents written in Latin characters. This paper aims at exploring the technique of Boosting and its effectiveness with the automatic classification of Arabic documents and compares its performance with results obtained respectively with Support Vector Machines and Naïve Bayesian Networks.

[1]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6]  Joseph Dichy,et al.  Pour une lexicomatique de l'arabe : l'unité lexicale simple et l'inventaire fini des spécificateurs du domaine du mot , 1997 .

[7]  István Pilászy,et al.  Text Categorization and Support Vector Machines , 2005 .

[8]  Joseph Dichy Arabic lexica in a cross-lingual perspective , 2002 .

[9]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[10]  Joseph Dichy,et al.  AraConc, an Arabic Concordance Software Based on the DIINAR.1 Language Resource , 2008 .

[11]  Riadh Ouersighni A major offshoot of the DIINAR-MBC project: AraParse, a morpho- syntactic analyzer for unvowelled Arabic texts , 2001 .

[12]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[13]  M. Govindarajan,et al.  Text Mining Technique for Data Mining Application , 2007 .

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  S. Domínguez-Almendros,et al.  Logistic regression models. , 2011, Allergologia et immunopathologia.

[16]  Michael A. Shepherd,et al.  Support vector machines for text categorization , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.