A Novel Term Weighting Scheme and an Approach for Classification of Agricultural Arabic Text Complaints

In this paper, a machine learning based approach for classification of farmers’ complaints which are in Arabic text into different crops has been proposed. Initially, the complaints are preprocessed using stop word removal, auto correction of words, handling some special cases and stemming to extract only the content terms. Some of the domain specific special cases which may affect the classification performance are handled. A new term weighting scheme called Term Class Weight-Inverse Class Frequency (TCW-ICF) is then used to extract the most discriminating features with respect to each class. The extracted features are then used to represent the preprocessed complaints in the form of feature vectors for training a classifier. Finally, an unlabeled complaint is classified as a member of one of the crop classes by the trained classifier. Nevertheless, a relatively large dataset consisting of more than 5000 complaints of the farmers described in Arabic script from eight different crops has been created. The proposed approach has been experimentally validated by conducting an extensive experimentation on the newly created dataset using KNN classifier. It has been argued that the proposed outperforms the baseline Vector Space Model (VSM). Further, the superiority of the proposed term weighting scheme in selecting the best set of discriminating features has been demonstrated through a comparative analysis against four well-known feature selection techniques. The new term is applied on Arabic script as a case study but it can be applied on any text data from any language.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Ahmed A. Rafea,et al.  Mining Farmers Problems in Web-based Texual Database Application , 2010, ICEIS.

[3]  D. S. Guru,et al.  A Novel Term_Class Relevance Measure for Text Categorization , 2016, ArXiv.

[4]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[5]  Ismail Hmeidi,et al.  Performance of KNN and SVM classifiers on full word Arabic articles , 2008, Adv. Eng. Informatics.

[6]  Ibrahim Abu El-Khair,et al.  Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study , 2017, ArXiv.

[7]  Timothy A. Gonsalves,et al.  Feature Selection for Text Classification Based on Gini Coefficient of Inequality , 2010, FSDM.

[8]  Philip S. Yu,et al.  On using partial supervision for text categorization , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Fawaz A. Al Zaghoul,et al.  Arabic Text Classification Based on Features Reduction Using Artificial Neural Networks , 2013, 2013 UKSim 15th International Conference on Computer Modelling and Simulation.

[10]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.

[11]  Ahmed A. Rafea,et al.  An Approach for Mining Accumulated Crop Cultivation Problems and their Solutions , 2010, AAAI Spring Symposium: Artificial Intelligence for Development.

[12]  Mahmoud Al-Ayyoub,et al.  Feature extraction and selection for Arabic tweets authorship authentication , 2017, J. Ambient Intell. Humaniz. Comput..

[13]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..