Enhanced Web Objects Classification using Social Tags

The automatic classification of Web objects into semantic categories is very important to facilitate indexing, browsing, searching, and mining these objects. But this is a very challenging task, because web objects often suffer from a lack of easy-extractable features with semantic information, interconnections between each other, and training examples with category labels. Social tags reflect the web objects semantics from users’ points of view, which makes them an ideal web objects feature that overcomes the difficulties of web object classification. In this paper we study the impact of using social tagging on the performance of text classification techniques in web objects classification. An automated system for web objects classification has been developed that is based on social tags exploration. The system has three phases: data preprocessing, classification and evaluation phases. It accepts a training dataset that represents a set of web pages with its URLs, tags, titles and categories. Using this dataset, the system constructs a predictive model that is later used to assign labels to web objects based on their tags. In the classification step, the system employs three known text classification techniques namely, Support Vector Machine, Naïve Bayes, and Decision Tree, through the WEKA software. Experiments have been conducted to evaluate the effectiveness of using social tags with each one of the three text classification techniques in web objects classification. The experimental results indicate that using tags significantly improve the classification performance. Keywords-web objects classification; social tagging; text classification methods; WEKA software; cross validation.

[1]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[2]  Yiming Yang,et al.  Hypertext Categorization using Hyperlink Patterns and Meta Data , 2001, ICML.

[3]  Rui Li,et al.  Exploring social tagging graph for web object classification , 2009, KDD.

[4]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[5]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.

[6]  Wei-Ying Ma,et al.  IRC: an iterative reinforcement categorization algorithm for interrelated Web objects , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[7]  Gerhard Weikum,et al.  Efficient top-k querying over social-tagging networks , 2008, SIGIR '08.

[8]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[9]  Xin Li,et al.  Tag-based social interest discovery , 2008, WWW.

[10]  Wei-Hao Lin,et al.  News video classification using SVM-based multimodal classifiers and combination strategies , 2002, MULTIMEDIA '02.

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[12]  Georgia Koutrika,et al.  Can social bookmarking improve web search? , 2008, WSDM '08.

[13]  Margaret Miró-Julià,et al.  Data Mining Techniques for Web Page Classification , 2011, PAAMS.

[14]  Fabrício Enembreck,et al.  WEB Image Classification Based on the Fusion of Image and Text Classifiers , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[15]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[16]  Yong Yu,et al.  Exploring social annotations for the semantic web , 2006, WWW '06.

[17]  Qiang Yang,et al.  A comparison of implicit and explicit links for web page classification , 2006, WWW '06.