A novel feature selection framework for automatic web page classification

The number of Internet users and the number of web pages being added to www increase dramatically every day. It is therefore required to automatically and efficiently classify web pages into web directories. This helps the search engines to provide users with relevant and quick retrieval results. As web pages are represented by thousands of features, feature selection helps the web page classifiers to resolve this large scale dimensionality problem. This paper proposes a new feature selection method using Ward's minimum variance measure. This measure is first used to identify clusters of redundant features in a web page. In each cluster, the best representative features are retained and the others are eliminated. Removing such redundant features helps in minimizing the resource utilization during classification. The proposed method of feature selection is compared with other common feature selection methods. Experiments done on a benchmark data set, namely WebKB show that the proposed method performs better than most of the other feature selection methods in terms of reducing the number of features and the classifier modeling time.

[1]  Qiang Shen,et al.  Rough sets, their extensions and applications , 2007, Int. J. Autom. Comput..

[2]  Saadat M. Alhashmi,et al.  Joint Web-Feature (JFEAT): A Novel Web Page Classification Framework , 2010 .

[3]  Zhong Ming,et al.  Text Learning and Hierarchical Feature Selection in Webpage Classification , 2008, ADMA.

[4]  Sadaaki Miyamoto,et al.  Proceedings of the 5th international conference on Rough Sets and Current Trends in Computing , 2006 .

[5]  Ali Selamat,et al.  Web page feature selection and classification using neural networks , 2004, Inf. Sci..

[6]  Chih-Ming Chen,et al.  Two novel feature selection approaches for web page classification , 2009, Expert Syst. Appl..

[7]  Toshiko Wakaki,et al.  Rough Set-Aided Feature Selection for Automatic Web-Page Classification , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[8]  Selma Ayse Ozel A genetic algorithm based optimal feature selection for Web page classification , 2011, 2011 International Symposium on Innovations in Intelligent Systems and Applications.

[9]  R. Rajaram,et al.  Effective and efficient feature selection for large-scale data using Bayes’ theorem , 2009, Int. J. Autom. Comput..

[10]  Maryam Mahmoudi,et al.  A Persian Web Page Classifi er Applying a Combination of Content-Based and Context-Based Features , 2009 .

[11]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[12]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[13]  R. Rajaram,et al.  Generating Best Features for Web Page Classification , 2008, Webology.

[14]  Qiang Shen,et al.  Webpage Classification with ACO-Enhanced Fuzzy-Rough Feature Selection , 2006, RSCTC.