Classifying Web Pages using Support Vector Machine

Introduction and Background With the rapid development of World Wide Web, the knowledge in web page grows explosively. Due to its swarm, the information overload and information unavailability are the tribulations in every web search engine. In addition the web pages are not consistently structured. Hence the web page classification is a vital task in every web search engine. Web page categorization is a significant in numerous information retrieval tasks such as retrieval of scientific papers, e-books and digital library from the web. In web usage mining the web page classification consumes to build customized web services to individual web users. Web structure mining is concerned with discovering the model underlying the link structure on the web page, for example to envisage the links between terrorists in social networks. In web page filters such as e-mail filter, content filter, web content filtering determines the content that is to be blocked in a web page. Thus a web page categorization helps to reach competent web information retrieval, web content filtering, web structure mining and web usage mining. A variety of rules based and machine learning techniques are currently in use for web page categorization. In [1], the various supervised learning techniques namely, decision tree, knearest neighbor, one r, multilayer perceptron and rbf kernel are adopted for web page categorization. Web page categorization has been implemented using three feature selection techniques like filter model, wrapper model and hybrid model along with the page rank algorithm in order to decrease the redundant features in the web page [2]. In [3], the authors have used different features that are extracted from HTML source code and URL with a compound of HTML and URL along with its information sibling pages, for web page categorization. Naive Bayes algorithm is used as a classifier and it is compared with semi-supervised algorithm such as co-training and expectation maximization and inductive logic programming have been applied to increase the performance in weak learner for web page categorization in [4]. The research work offered in this paper syndicates the features of web pages declared in [1] [2] [3] [4] and identifies few innovative features which can contribute more in the ideal classification of web pages. The features such as strings between slashes and dots in the href attributes of all anchor tags, strings between underscores and minus symbols in the href attributes of all anchor tags, defined in HTML source code of web pages are additionally used. These features have been used to incorporate reference mechanisms available in web pages such as tables, footnotes and bibliographies. They also provide the interconnection between linked web pages, it consists of text, images, video and other multimedia contents. The proposed web page categorization model also employs novel URL features such as, substring between underscores and minus symbols of URLs, substring between two different symbols of URLs, apart from those used in the existing work. These features have been used to provide additional resources to the URL. Hence these features are very much essential and guarantee to contribute more in web page categorization. In most of the existing work, web page categorization was carried out to classify the web pages of similar domain. Here the web pages of different domains like arts, business, culture, education, entertainment, health and wellness have been considered for categorization. This paper elucidates the implementation of support vector machine for classifying the web pages of six divergent domains. The features are extracted from HTML structures and URLs of a set of web pages in different categories. Feature extraction and the experiments carried out are described in rest of this paper. Proposed Web Page Classification Model The proposed web page categorization model decreases the convolutions in web mining. The different categories of web pages are composed arbitrarily from the search engines. The acquired web pages are preprocessed and features are extracted from HTML structure and URL using feature extraction methods. The training data set with instances associated to six domains such as arts, business, culture, education,

[1]  Yuhui Qiu,et al.  Study of Web Information Extraction and Classification Method , 2007, 2007 International Conference on Wireless Communications, Networking and Mobile Computing.

[2]  Sl Ting,et al.  Is Naïve bayes a good classifier for document classification , 2011 .

[3]  Sini Shibu,et al.  A combination approach for Web Page Classification using Page Rank and Feature Selection Technique , 2010, International Journal of Computer Theory and Engineering.

[4]  Choochart Haruechaiyasak,et al.  Hierarchical Web Page Classification Based on a Topic Model and Neighboring Pages Integration , 2010, ArXiv.

[5]  Lilac A. E. Al-Safadi,et al.  Auto Classification for Search Intelligence , 2009 .

[6]  K Ramya,et al.  Analysis of Users' Web Navigation Behavior using GRPA with Variable Length Markov Chains , 2011 .

[7]  J. Alamelu Mangai,et al.  A Novel Approach for Web Page Classification using Optimum features , 2011 .

[8]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[9]  G. N. Purohit,et al.  Page Ranking Algorithms for Web Mining , 2011 .

[10]  Amir Masoud Rahmani,et al.  Webpage Classification based on Compound of Using HTML Features & URL Features and Features of Sibling Pages , 2010, Int. J. Adv. Comp. Techn..

[11]  Y. K. Jain,et al.  Classification-based Retrieval Methods to Enhance Information Discovery on the Web , 2011, ICCA 2010.

[12]  Hung Hum,et al.  Is Naïve Bayes a Good Classifier for Document Classification , 2011 .

[13]  R. Kass,et al.  Multiple neural spike train data analysis: state-of-the-art and future challenges , 2004, Nature Neuroscience.

[14]  Nuanwan Soonthornphisaj,et al.  Combining ILP with Semi-supervised Learning for Web Page Categorization , 2004, International Conference on Computational Intelligence.

[15]  Wen Li,et al.  A Naive Bayesian Multi-label Classication Algorithm With Application to Visualize Text Search Results , 2011 .

[16]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[17]  K. Selvakuberan,et al.  Combined Feature Selection and classification – A novel approach for the categorization of web pages , 2008 .