Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms

Abstract Multi-label text categorization refers to the problem of assigning each document to a subset of categories by means of multi-label learning algorithms. Unlike English and most other languages, the unavailability of Arabic benchmark datasets prevents evaluating multi-label learning algorithms for Arabic text categorization. As a result, only a few recent studies have dealt with multi-label Arabic text categorization on non-benchmark and inaccessible datasets. Therefore, this work aims to promote multi-label Arabic text categorization through (a) introducing “RTAnews”, a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks. The benchmark is publicly available in several formats compatible with the existing multi-label learning tools, such as MEKA and Mulan. (b) Conducting an extensive comparison of most of the well-known multi-label learning algorithms for Arabic text categorization in order to have baseline results and show the effectiveness of these algorithms for Arabic text categorization on RTAnews. The evaluation involves four multi-label transformation-based algorithms: Binary Relevance, Classifier Chains, Calibrated Ranking by Pairwise Comparison and Label Powerset, with three base learners (Support Vector Machine, k-Nearest-Neighbors and Random Forest); and four adaptation-based algorithms (Multi-label kNN, Instance-Based Learning by Logistic Regression Multi-label, Binary Relevance kNN and RFBoost). The reported baseline results show that both RFBoost and Label Powerset with Support Vector Machine as base learner outperformed other compared algorithms. Results also demonstrated that adaptation-based algorithms are faster than transformation-based algorithms.

[1]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[2]  Viviane Pereira Moreira,et al.  Assessing the impact of Stemming Accuracy on Information Retrieval - A multilingual perspective , 2016, Inf. Process. Manag..

[3]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[4]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[5]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[6]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[7]  Xiaodong Gu,et al.  Balancing between over-weighting and under-weighting in supervised term weighting , 2016, Inf. Process. Manag..

[8]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[9]  Geoff Holmes,et al.  MEKA: A Multi-label/Multi-target Extension to WEKA , 2016, J. Mach. Learn. Res..

[10]  Kevin K Dobbin,et al.  Optimally splitting cases for training and testing high dimensional classifiers , 2011, BMC Medical Genomics.

[11]  Eyke Hüllermeier,et al.  Multilabel classification via calibrated label ranking , 2008, Machine Learning.

[12]  Taisir Eldos,et al.  Arabic Text Data Mining: a Root-Based Hierarchical Indexing Model , 2003 .

[13]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[14]  Muhammad Abdul-Mageed,et al.  Modeling Arabic subjectivity and sentiment in lexical space , 2017, Inf. Process. Manag..

[15]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[16]  Masoud Rahgozar,et al.  A query term re-weighting approach using document similarity , 2016, Inf. Process. Manag..

[17]  Bassam Al-Salemi,et al.  LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization , 2015, J. Inf. Sci..

[18]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[19]  Bassam Al-Salemi,et al.  Feature ranking for enhancing boosting-based multi-label text categorization , 2018, Expert Syst. Appl..

[20]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[21]  Grigorios Tsoumakas,et al.  An Empirical Study of Lazy Multilabel Classification Algorithms , 2008, SETN.

[22]  Haytham Elghazel,et al.  Ensemble multi-label text categorization based on rotation forest and latent semantic indexing , 2016, Expert Syst. Appl..

[23]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[24]  Kareem Darwish,et al.  Using Twitter to Collect a Multi-Dialectal Corpus of Arabic , 2014, ANLP@EMNLP.

[25]  Mahmoud Al-Ayyoub,et al.  Scalable multi-label Arabic text classification , 2015, 2015 6th International Conference on Information and Communication Systems (ICICS).

[26]  Andrea Esuli,et al.  MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization , 2006, SPIRE.

[27]  Eyke Hüllermeier,et al.  Label ranking by learning pairwise preferences , 2008, Artif. Intell..

[28]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[29]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[30]  Mahmoud Al-Ayyoub,et al.  A supervised approach for multi-label classification of Arabic news articles , 2016, 2016 7th International Conference on Computer Science and Information Technology (CSIT).

[31]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[32]  Bassam Al-Salemi,et al.  RFBoost: An improved multi-label boosting algorithm and its application to text categorisation , 2016, Knowl. Based Syst..

[33]  Bassam Al-Salemi,et al.  Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study , 2015, J. Inf. Sci..

[34]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[35]  Abdur Rehman,et al.  Feature selection based on a normalized difference measure for text classification , 2017, Inf. Process. Manag..

[36]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[37]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[38]  Mahmoud Al-Ayyoub,et al.  A lexicon based approach for classifying Arabic multi-labeled text , 2016, Int. J. Web Inf. Syst..

[39]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[40]  Yunming Ye,et al.  ForesTexter: An efficient random forest algorithm for imbalanced text categorization , 2014, Knowl. Based Syst..

[41]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[42]  Adil Yaseen Taha,et al.  Binary relevance (BR) method classifier of multi-label classification for arabic text , 2016 .

[43]  Yonatan Belinkov,et al.  Language processing and learning models for community question answering in Arabic , 2017, Inf. Process. Manag..

[44]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[45]  Bo Tang,et al.  Toward Optimal Feature Selection in Naive Bayes for Text Categorization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[46]  Jesse Read,et al.  A Pruned Problem Transformation Method for Multi-label Classification , 2008 .

[47]  Balázs Kégl,et al.  MULTIBOOST: A Multi-purpose Boosting Package , 2012, J. Mach. Learn. Res..

[48]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[49]  Eyke Hüllermeier,et al.  Combining Instance-Based Learning and Logistic Regression for Multilabel Classification , 2009, ECML/PKDD.

[50]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..