Semantic Based Highly Accurate Autonomous Decentralized URL Classification System for Web Filtering

Currently cyberspace has got about one billion registered websites, and it is imperative to accurately categorize voluminous number of website/URLs for the purpose of URL filtering and marketing segmentation. This paper presents autonomous decentralized semantic based large-scale URL/web classification system for web filtering using Yago2s and DS-onto knowledgebase. As many predefined categories are highly overlapping or semantically similar, proposed word sense disambiguation algorithm along with inference engine design brings high accuracy for classification of URLs in to 120 different categories. Evaluation results show that it achieves 90-93% of accuracy which is much higher than that obtained by currently used URL classification systems.

[1]  Kinji Mori,et al.  Autonomous L3 Cache Technolgy for High Responsiveness , 2012, J. Inf. Process..

[2]  Eneko Agirre,et al.  Two birds with one stone: learning semantic models for text categorization and word sense disambiguation , 2011, CIKM '11.

[3]  Muhammad Rafi,et al.  Content-based Text Categorization using Wikitology , 2012, ArXiv.

[4]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[5]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[6]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[7]  Chien Chou,et al.  A Review of the Research on Internet Addiction , 2005 .

[8]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[9]  Kinji Mori,et al.  Autonomous decentralized systems: Concept, data field architecture and future trends , 1993, Proceedings ISAD 93: International Symposium on Autonomous Decentralized Systems.

[10]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[11]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[12]  Maciej Janik,et al.  Training-less ontology-based text categorization , 2008 .

[13]  Alexander F. Gelbukh,et al.  Simple Window Selection Strategies for the Simplified Lesk Algorithm for Word Sense Disambiguation , 2013, MICAI.

[14]  Jyrki Wallenius,et al.  Semantic Content Filtering with Wikipedia and Ontologies , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[15]  Hironao Takahashi,et al.  Autonomous Decentralized Semantic Based URL Filtering System for Low Latency , 2015, 2015 IEEE Twelfth International Symposium on Autonomous Decentralized Systems.