论文信息 - SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. Therefore, we introduce a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics. We anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic (MSA) textual data, especially for single label text classification purposes. We present the data in raw form. SANAD is composed of three main datasets scraped from three news portals, which are AlKhaleej, AlArabiya, and Akhbarona. SANAD is made public and freely available at https://data.mendeley.com/datasets/57zpx667y9.

[1] Ashraf Elnagar,et al. BRAD 1.0: Book reviews in Arabic dataset , 2016, 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA).

[2] Hamdy M. Mousa,et al. Improving Arabic Text Categorization using Normalization and Stemming Techniques , 2016 .

[3] Bassam Al-Salemi,et al. RTAnews: A Benchmark for Multi-label Arabic Text Categorization , 2018 .

[4] Mahmoud Al-Ayyoub,et al. Automatic Arabic text categorization: A comprehensive comparative study , 2015, J. Inf. Sci..

[5] Adel Hamdan Mohammad,et al. Arabic Text Categorization Using Support vector machine, Naïve Bayes and Neural Network , 2016 .