A Novel Approach for Web Page Classification using Optimum features

The boom in the use of Web and its exponential growth are now well known. The amount of textual data available on the Web is estimated to be in the order of one terra byte, in addition to images, audio and video. This has imposed additional challenges to the Web directories which help the user to search the Web by classifying selected Web documents into subject. Manual classification of web pages by human expertise also suffers from the exponential increase in the amount of Web documents. Instead of using the entire web page for classifying it, this article emphasizes the need for automatic web page classification using minimum number of features in it. A method for generating such optimum number of features for web pages is also proposed. Machine learning classifiers are modeled using these optimum features. Experiments on the bench marking data sets with these machine learning classifiers have shown promising improvement in classification accuracy.

[1]  Elizabeth Chang,et al.  An Ontology-Based Webpage Classification Approach for the Knowledge Grid Environment , 2009, 2009 Fifth International Conference on Semantics, Knowledge and Grid.

[2]  Rung Ching Chen,et al.  Web page classification based on a support vector machine using a weighted vote schema , 2006, Expert Syst. Appl..

[3]  Ling Guan,et al.  Automatic Web Page Classification Using Various Features , 2008, PCM.

[4]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[5]  Petra Perner,et al.  Empirical evaluation of feature subset selection based on a real-world data set , 2000, Eng. Appl. Artif. Intell..

[6]  Amir Masoud Rahmani,et al.  Webpage Classification based on URL Features and Features of Sibling Pages , 2010 .

[7]  Choochart Haruechaiyasak,et al.  Hierarchical Web Page Classification Based on a Topic Model and Neighboring Pages Integration , 2010, ArXiv.

[8]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[9]  Jong-Hyeok Lee,et al.  Web page classification based on k-nearest neighbor approach , 2000, IRAL '00.

[10]  Alex Alves Freitas,et al.  Web Page Classification with an Ant Colony Algorithm , 2004, PPSN.

[11]  Takashi Washio,et al.  Automatic Web-Page Classification by Using Machine Learning Methods , 2001, Web Intelligence.

[12]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[13]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[14]  Arul Prakash Asirvatham,et al.  Web Page Classification based on Document Structure , 2001 .

[15]  Petra Perner,et al.  Empirical Evaluation of Feature Subset Selection Based on a Real-World Data Set , 2000, PKDD.

[16]  Yuxin Wang,et al.  Web Page Classification Exploiting Contents of Surrounding Pages for Building a High-Quality Homepage Collection , 2006, ICADL.

[17]  Qing Yang,et al.  Entity-Based Classification of Web Page in Search Engine , 2008, ICADL.

[18]  Alan H. Strahler,et al.  Feature selection using decision trees-an application for the MODIS land cover algorithm , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[19]  Xiaogang Peng,et al.  Automatic web page classification in a dynamic and hierarchical way , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[20]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.