Categorisation of web pages for protection against inappropriate content in the internet

The paper outlines a framework for automated categorisation of web pages to protect against inappropriate content. The paper contains the framework overview, analysis of state-of-the-art, description of the developed prototype and its evaluation based on series of experiments. Several sources are used for the categorisation, namely text, HTML tags and URL addresses. During the categorisation, this data and other information are analysed using machine learning and data mining methods. Finally, the evaluation of the categorisation quality is performed. The categorisation system developed as a result of this work are planned to be partially implemented in F-Secure Corporation in mass production systems performing analysis of web content.

[1]  Jong-Hyeok Lee,et al.  Web page classification based on k-nearest neighbor approach , 2000, IRAL '00.

[2]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[3]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[4]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[5]  Brian D. Davison,et al.  Knowing a web page by the company it keeps , 2006, CIKM '06.

[6]  Jie Qin,et al.  A Web Page Classification Algorithm Based on Link Information , 2011, 2011 10th International Symposium on Distributed Computing and Applications to Business, Engineering and Science.

[7]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[8]  Igor V. Kotenko,et al.  Improving the Categorization of Web Sites by Analysis of Html-Tags Statistics to Block Inappropriate Content , 2015, IDC.

[9]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.

[10]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[11]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[12]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  Igor V. Kotenko,et al.  Analysis and Evaluation of Web Pages Classification Techniques for Inappropriate Content Blocking , 2014, ICDM.

[15]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[16]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[17]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[18]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[19]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[20]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[21]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[22]  Ajay S. Patil,et al.  Automated Classification of Web Sites using Naive Bayesian Algorithm , 2012 .