Automated Classification of Web Sites using Naive Bayesian Algorithm

 Abstract— Subject based web directories like Open Directory Project's (ODP) Directory Mozilla (DMOZ), Yahoo etc., consists of web pages classified into various categories. The proper classification has made these directories popular among the web users. The exponential growth of the web has made it difficult to manage human edited subject based web directories. The World Wide Web (WWW) lacks a comprehensive web site directory. Web site classification using machine learning techniques is therefore an emerging possibility to automatically maintain directory services for the web. Home page of a web site is a distinguished page and it acts as an entry point by providing links to the rest of the web site. The information contained in the title, meta keyword, description and in the labels of the anchor (A HREF) tags along with the other content is a very rich source of features required for classification. Compared to the other pages of the website, webmasters take more care to design the homepage and its content to give it an aesthetic look and at the same time attempt to precisely summarize the organization to which the site belongs. This expression power of the home page of a website can be exploited to identify the nature of the organization. In this paper we attempt to classify web sites based on the content of their home pages using the Naive Bayesian machine learning algorithm.

[1]  Jong-Hyeok Lee,et al.  Web page classification based on k-nearest neighbor approach , 2000, IRAL '00.

[2]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3]  John M. Pierre,et al.  Practical Issues for Automated Categorization of Web Sites , 2000 .

[4]  Bin Fan,et al.  Web Page Classification Based on a Least Square Support Vector Machine with Latent Semantic Analysis , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[5]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[6]  Qiang Yang,et al.  A comparison of implicit and explicit links for web page classification , 2006, WWW '06.

[7]  Oscar Castillo,et al.  Proceedings of the International MultiConference of Engineers and Computer Scientists 2007, IMECS 2007, March 21-23, 2007, Hong Kong, China , 2007, IMECS.

[8]  Anju Vyas Print , 2003 .

[9]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[10]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[11]  Yaxin Bi,et al.  An kNN Model-Based Approach and Its Application in Text Categorization , 2004, CICLing.

[12]  Amir Masoud Rahmani,et al.  A Classifier-CMAC Neural Network Model for Web Mining , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[13]  Li Xiaoli,et al.  Innovating web page classification through reducing noise , 2002 .

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[16]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[17]  Stefan Wermter,et al.  Neural Network Agents for Learning Semantic Text Classification , 2000, Information Retrieval.

[18]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[19]  Zenglin Xu,et al.  Web page classification with heterogeneous data fusion , 2007, WWW '07.

[20]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[21]  Zhongzhi Shi,et al.  Innovating Web page classification through reducing noise , 2008, Journal of Computer Science and Technology.

[22]  Andreas S. Weigend,et al.  Exploiting Hierarchy in Text Categorization , 1999, Information Retrieval.

[23]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[24]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.