A Web page classification system based on a genetic algorithm using tagged-terms as features

The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (GA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our GA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naive Bayes and k nearest neighbor classifiers.

[1]  Chih-Ming Chen,et al.  Two novel feature selection approaches for web page classification , 2009, Expert Syst. Appl..

[2]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[3]  Byoung-Tak Zhang,et al.  Genetic Mining of HTML Structures for Effective Web-Document Retrieval , 2003, Applied Intelligence.

[4]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[5]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[6]  Peter Willett,et al.  An Upperbound to the Performance of Ranked-output Searching: Optimal Weighting of Query Terms using a Genetic Algorithm , 1996, J. Documentation.

[7]  Toshiko Wakaki,et al.  A study on rough set-aided feature selection for automatic web-page classification , 2006, Web Intell. Agent Syst..

[8]  Pasquale Rullo,et al.  A Genetic Algorithm for Text Classification Rule Induction , 2008, ECML/PKDD.

[9]  Ali Selamat,et al.  Web page feature selection and classification using neural networks , 2004, Inf. Sci..

[10]  Rung Ching Chen,et al.  Web page classification based on a support vector machine using a weighted vote schema , 2006, Expert Syst. Appl..

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Hsinchun Chen,et al.  GANNET: A Machine Learning Approach to Document Retrieval , 1994, J. Manag. Inf. Syst..

[13]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[14]  Andrew Trotman,et al.  Choosing document structure weights , 2005, Inf. Process. Manag..

[15]  Michael D. Gordon Probabilistic and genetic algorithms in document retrieval , 1988, CACM.

[16]  Angela Ribeiro,et al.  Web Page Classification: A Soft Computing Approach , 2003, AWIC.

[17]  S.A. Ozel,et al.  Focused crawler for finding professional events based on user interests , 2008, 2008 23rd International Symposium on Computer and Information Sciences.

[18]  S. Ayse Ozalp,et al.  A Genetic Algorithm for Scheduling of Jobs on Lines of Press Machines , 2005, LSSC.

[19]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[20]  Bo Sun,et al.  A genetic K-means approaches for automated Web page classification , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[21]  Mohand Boughanem,et al.  Genetic Approach to Query Space Exploration , 2004, Information Retrieval.

[22]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[23]  Özgür Ulusoy,et al.  Topic-Centric Querying of Web Information Resources , 2001, DEXA.

[24]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[25]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[26]  Xiaoyue Wang,et al.  Combination of Rough Sets and Genetic Algorithms for Text Classification , 2007, AIS-ADM.

[27]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[28]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[29]  Hong Liu,et al.  A Genetic Semi-supervised Fuzzy Clustering Approach to Text Classification , 2003, WAIM.