A comparison of Text Classification methods Method of weighted terms selected by different Stemming Techniques

In the retrieval information, three factors have an important impact on the systems performance: the stemmer algorithm, the extract feature method and the classification tool. In this work, we compare three well-known stemming Techniques: Lovins stemmer, iterated Lovins and snowball Stemmer. Concerning the classification phase, we compare, experimentally, five methods: BNET, NBMU, RF, SLogicF, and SVM. Basing on these latter, we propose a new retrieval system by calling the vote method to improve the performance of the classical systems. In this paper, we use the TFIDF algorithm to extract features. The envisaged systems are testing on two databases: BBCNEWS and BBCSPORT. The systems based on Lovins stemmers and on the voting technique give the best results. In fact, for the first databases, the best accuracy observed is for the system Lovins +Vote with a recognition rate about 97%. Concerning the second database, the system snowball +Vote that gives us 99% as recognition rate.

[1]  Uffe Kjærulff,et al.  Bayesian Networks and Influence Diagrams: A Guide to Construction and Analysis , 2007, Information Science and Statistics.

[2]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[3]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  El Wardani Dadi,et al.  Clustering Problem with 0–1 Quadratic Programming , 2016 .

[6]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[7]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[8]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[9]  Edward Fox,et al.  Extending the boolean and vector space models of information retrieval with p-norm queries and multiple concept types , 1983 .

[10]  Khalid Satori,et al.  A comparison of supervised classification methods for a statistical set of features: Application: Amazigh OCR , 2015, 2015 Intelligent Systems and Computer Vision (ISCV).

[11]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[12]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[13]  Anjali Ganesh Jivani,et al.  A Comparative Study of Stemming Algorithms , 2011 .

[14]  Andreas Handojo,et al.  Document Searching Engine Using Term Similarity Vector Space Model on English and Indonesian Document , 2015, SOCO 2015.

[15]  Khalid Satori,et al.  Robust Face Recognition Using Local Gradient Probabilistic Pattern (LGPP) , 2016 .

[16]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[17]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[18]  Laure Soulier Définition et évaluation de modèles de recherche d'information collaborative basés sur les compétences de domaine et les rôles des utilisateurs. (Definition and evaluation of collaborative ranking models based on users' domain expertise and roles) , 2014 .