Analysis and Evaluation of Web Pages Classification Techniques for Inappropriate Content Blocking

The paper considers the problem of automated categorization of web sites for systems used to block web pages that contain inappropriate content. In the paper we applied the techniques of analysis of the text, html tags, URL addresses and other information using Machine Learning and Data Mining methods. Besides that, techniques of analysis of sites that provide information in different languages are suggested. Architecture and algorithms of the system for collecting, storing and analyzing data required for classification of sites are presented. Results of experiments on analysis of web sites’ correspondence to different categories are given. Evaluation of the classification quality is performed. The classification system developed as a result of this work is implemented in F-Secure mass production systems performing analysis of web content.

[1]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[2]  Valentine Kabanets,et al.  Correlation Bounds and #SAT Algorithms for Small Linear-Size Circuits , 2015, COCOON.

[3]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[4]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[5]  Takashi Washio,et al.  Automatic Web-Page Classification by Using Machine Learning Methods , 2001, Web Intelligence.

[6]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[7]  Chung-Hsien Wu,et al.  Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology , 2002, TALIP.

[8]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[9]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[10]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[11]  Sini Shibu,et al.  A combination approach for Web Page Classification using Page Rank and Feature Selection Technique , 2010, International Journal of Computer Theory and Engineering.

[12]  Daniel P. Siewiorek,et al.  HIGH-AVAILABILITY SYSTEMS , 1992 .

[13]  Peter Schauble Multimedia Information Retrieval: Content-Based Information Retrieval from Large Text and Audio Databases , 2012 .

[14]  Ning Zhong,et al.  Web Intelligence: Research and Development , 2001, Lecture Notes in Computer Science.

[15]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[16]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[17]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[18]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[19]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[20]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[21]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[22]  Ajay S. Patil,et al.  Automated Classification of Web Sites using Naive Bayesian Algorithm , 2012 .

[23]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[24]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.

[25]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[26]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[27]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[28]  Brian D. Davison,et al.  Knowing a web page by the company it keeps , 2006, CIKM '06.

[29]  Jie Qin,et al.  A Web Page Classification Algorithm Based on Link Information , 2011, 2011 10th International Symposium on Distributed Computing and Applications to Business, Engineering and Science.

[30]  Jong-Hyeok Lee,et al.  Web page classification based on k-nearest neighbor approach , 2000, IRAL '00.