Text Classification of News Articles Using Machine Learning on Low-resourced Language: Tigrigna

Text categorization or Textual document is a method that becomes more significant in tagging a textual document to their most relevant label. However, not all languages have parallel textual growth; without free and absences of a dataset, text categorization becomes interesting for Tigrigna language, i.e., low-resourced language. Our aim to identify the given document to its categories based on its linguistic features. To achieve our goal, we have constructed a new dataset from different Tigrigna news sources. The dataset has six main categories: Agriculture, Sports, Health, Education, Religion, and Politics. Each collected is article preprocessed from Latin characters, punctuations, and stop words. We deployed a collection of different classical machine learning classifiers to investigate its effectiveness in our datasets. Namely, 7 popular classifiers were used, Logistic Regression, Nearest Centroid, Decision Tree (DT), Support Vector Machines (SVM), K-nearest neighbors (KNN), Random Forest Classifier, and Multi-Layer Perceptron (MLP). Ensemble models also implemented to get the best accuracy by combining the best classifiers based on their majority-voting classifiers. Our experimental results showed reliable performance with a minimum F1-score of 89.1% achieved by Nearest centroid and top performance of 96 % achieved by SVM. The experimental results presented in terms of precision, Recall, and F1-scores.

[1]  Kazuhide Yamamoto,et al.  The effect of shallow segmentation on English-Tigrinya statistical machine translation , 2016, 2016 International Conference on Asian Language Processing (IALP).

[2]  Kazuhide Yamamoto,et al.  Nagaoka Tigrinya Corpus : Design and Development of Part-of-speech Tagged Corpus , 2016 .

[3]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[4]  Rong Huang,et al.  Web spam classification method based on deep belief networks , 2018, Expert Syst. Appl..

[5]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[6]  L. Bolzoni,et al.  Mechanical properties and microstructure of Ti-Mn alloys produced via powder metallurgy for biomedical applications. , 2019, Journal of the mechanical behavior of biomedical materials.

[7]  Yonas Fisseha Development of Stemming Algorithm for Tigrigna Text , 2011 .

[8]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[9]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[10]  Akubazgi Gebremariam,et al.  Amharic-to-Tigrigna Machine Translation Using Hybrid Approach , 2017 .

[11]  Alaa El-Halees,et al.  A Comparative Study on Arabic Text Classification , 2008, Egypt. Comput. Sci. J..

[12]  Yoshiki Mikami,et al.  Stemming Tigrinya Words for Information Retrieval , 2012, COLING.

[13]  Nadir Durrani,et al.  Domain adaptation using neural network joint model , 2017, Comput. Speech Lang..

[14]  Meresa Mebrahtu Reda,et al.  Unsupervised Machine Learning Approach for Tigrigna Word Sense Disambiguation , 2018 .

[15]  Riyad Al-Shalabi,et al.  Improving KNN Arabic Text Classification with N-Grams Based Document Indexing , 2008 .

[16]  Ashraf Elnagar,et al.  Arabic text classification using deep learning models , 2020, Inf. Process. Manag..

[17]  Kazuhide Yamamoto,et al.  Analyzing word embeddings and improving POS tagger of tigrinya , 2017, 2017 International Conference on Asian Language Processing (IALP).

[18]  Thomas Blaschke,et al.  Evaluation of Different Machine Learning Methods and Deep-Learning Convolutional Neural Networks for Landslide Detection , 2019, Remote. Sens..

[19]  Teklay Muruts Word Sense Disambiguation for Tigrigna Language Using Semi-Supervised Machine Learning Approach , 2018 .