Classifying the Arabic web — A pilot study

The world-wide-web has become the favorite destination of information seekers across the globe. With its massive amount of information that includes billions of web pages, information for just about any topic is a click-of-finger away. Analyzing the massive content of the web has many important aspects such as information discovering, efficient search engines and social and political patterns. Web mining techniques such as text classification and categorization are being used to provide an “under-the-microscope” picture of the web. The Arabic web represents an important portion of the web. With Arabic as the 5th most spoken language in the world and with the increasing number of Arabic Internet users at exponential rates, it is becoming important to analyze the Arabic web content and study its trends. This paper presents a close look at the content of the Arabic web. It presents the percentiles of the contents of the web in five categories, namely, politics, culture, sports, economics and religion. We used two different text classification algorithms and compared their results. We have also compared between the two text classification techniques in terms of precision and recall. The classifiers shown that the economics and politics are the highest percentiles (65% combined) while the culture and religion categories scored the lowest percentiles (about 10% combined)

[1]  Zhijuan Jia,et al.  Web Text Categorization for Large-scale Corpus , 2010, 2010 International Conference on Computer Application and System Modeling (ICCASM 2010).

[2]  Kai Yang,et al.  Chinese Automatic Documents Classification System , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[3]  Mohamed F. Tolba,et al.  Challenges and design issues of an Arabic web crawler , 2010, The 2010 International Conference on Computer Engineering & Systems.

[4]  Rehab Duwairi,et al.  Educative and Adaptive System for Personalized Learning: Learning Styles and Content Adaptation , 2007 .

[5]  L. M. Yusuf,et al.  Features Discovery for Web Classification Using Support Vector Machine , 2010, 2010 International Conference on Intelligent Computing and Cognitive Informatics.

[6]  Ghassan Kanaan,et al.  A comparison of text-classification techniques applied to Arabic text , 2009 .

[7]  Tian Yu,et al.  Chinese Web Text Classification System Model Based on Naive Bayes , 2010, 2010 International Conference on E-Product E-Service and E-Entertainment.

[8]  M.F. Tolba,et al.  A Memory Efficient Approach for Crawling Language Specific Web: The Arabic Web as a Case Study , 2009, 2009 International Conference on Information Management and Engineering.