Classification of web pages on attractiveness: A supervised learning approach

Random surfers spend very little time on a web page. If the most important web page content fails to attract his attention within the short time span, he will move away to some other page, thus defeating the purpose of the web page designer. In order to predict if the contents of a web page will catch a random surfer's attention or not, we propose a machine learning based approach to classify web pages into “bad” and “not bad” classes, where the “bad” class implies poor attention drawing ability. We propose to divide web page contents into “objects”, which are coherent regions of web page conveying the same information, to develop the classifier approach. We surveyed 100 web pages sampled from the Internet to identify the type and frequency of objects used in web page design. From our survey, we identified six types of objects that are most important in determining the class of a web page, in terms of its attention drawing capability. We used the WEKA tool to implement the machine learning approach. Two different strategies of percentage split and three different strategies of cross validation are used to check for accuracy of the classifier. We have experimented with 65 algorithms supported by WEKA and found that the algorithms RBF network and Random subspace, among the 65, gives the best performance, with about 83% accuracy.

[1]  Tom Heskes,et al.  Categorization of web pages and user clustering with mixtures of hidden Markov models , 2008, KDD 2008.

[2]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[3]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[4]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.

[5]  Tom Heskes,et al.  Automatic Categorization of Web Pages and User Clustering with Mixtures of Hidden Markov Models , 2002, WEBKDD.

[6]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[7]  Vincenzo Loia,et al.  An Evolutionary Approach to Automatic Web Page Categorization and Updating , 2001, Web Intelligence.

[8]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[9]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Takashi Washio,et al.  Automatic Web-Page Classification by Using Machine Learning Methods , 2001, Web Intelligence.

[12]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[13]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.