A novel approach for effective web page classification

With the exponential increase in volume of the WWW every day, web page classification has become tedious. Since with no quality data there is no quality mining results, it is worth to emphasise on fine tuning the data for classification, rather than improving the classifiers themselves. This paper investigates the methods for improving web page classification by feature extraction, selection and data tuning. This paper also proposes a new classification model for web page classification called a probabilistic web page classifier (PWPC). It is based on a probabilistic framework and attribute-value similarity measure (AVS). The proposed method is tested on a benchmarking dataset, WebKB and the performance of PWPC on the fine tuned web pages has exhibited significant accuracy over the traditional machine learning classifiers.

[1]  Zhijing Liu,et al.  A Novel Approach to Naive Bayes Web Page Automatic Classification , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[2]  Yong Yu,et al.  A Novel Web Page Categorization Algorithm Based on Block Propagation Using Query-Log Information , 2006, WAIM.

[3]  Takashi Washio,et al.  Automatic Web-Page Classification by Using Machine Learning Methods , 2001, Web Intelligence.

[4]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[5]  S. Appavu alias Balamurugan,et al.  Improving decision tree performance by exception handling , 2010, Int. J. Autom. Comput..

[6]  Maryam Mahmoudi,et al.  A Persian Web Page Classifi er Applying a Combination of Content-Based and Context-Based Features , 2009 .

[7]  Graham J. Williams,et al.  Data Mining , 2000, Communications in Computer and Information Science.

[8]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[9]  Arul Prakash Asirvatham,et al.  Web Page Classification based on Document Structure , 2001 .

[10]  Saadat M. Alhashmi,et al.  Joint Web-Feature (JFEAT): A Novel Web Page Classification Framework , 2010 .

[11]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[12]  Chih-Ming Chen,et al.  Two novel feature selection approaches for web page classification , 2009, Expert Syst. Appl..

[13]  Toshiko Wakaki,et al.  Rough Set-Aided Feature Selection for Automatic Web-Page Classification , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[14]  Sun Bo,et al.  A Study on Automatic Web Pages Categorization , 2009, 2009 IEEE International Advance Computing Conference.

[15]  Zhong Ming,et al.  Text Learning and Hierarchical Feature Selection in Webpage Classification , 2008, ADMA.

[16]  Chris J. Hinde,et al.  Embarking on a Web Information Extraction project , 2007 .

[17]  Peiying Zhang,et al.  The Effective Classification of the Chines e Web Pages Based on KNN , 2010 .

[18]  Roger G. Stone,et al.  Naive Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages , 2009 .

[19]  Viktor de Boer,et al.  Classifying Web Pages with Visual Features , 2010, WEBIST.

[20]  R. Rajaram,et al.  Generating Best Features for Web Page Classification , 2008, Webology.

[21]  Ali Selamat,et al.  Web page feature selection and classification using neural networks , 2004, Inf. Sci..

[22]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .