Web objectionable text content detection using topic modeling technique

Web 2.0 technologies have made it easily for Web users to create and spread objectionable text content, which has been shown harmful to Web users, especially young children. Although detection methods based on key word list are superior in achieving faster detection and lower memory consumption, they fail to detect text content that is objectionable in semantic description. A framework that can perfectly integrate semantic model and detection method is proposed to perform probability inference for detecting this kind of Web text content. Based on the observation that an objectionable scene could be described by a set of sentences, a topic model which is learnt from the set is employed to act as a semantic model of the objectionable scene. For a given sentence, probability value which shows the likelihood of the sentence with respect to the model is calculated in the framework. Then we use a mapping function to transform the probability value into a new indicator which is convenient for making final decision. Extensive comparison experiments on two real world text sets show that the framework can effectively recognize semantic objectionable text, and both the detection rate and the false alarm rate are superior to those of traditional methods.

[1]  Naohiro Ishii,et al.  Text Classification: Combining Grouping, LSA and kNN vs Support Vector Machine , 2006, KES.

[2]  Yuan-Cheng Lai,et al.  An Early Decision Algorithm to Accelerate Web Content Filtering , 2006, ICOIN.

[3]  Wonhee Lee,et al.  Harmful Contents Classification Using the Harmful Word Filtering and SVM , 2007, International Conference on Computational Science.

[4]  Jianping Zeng,et al.  Variable space hidden Markov model for topic detection and analysis , 2007, Knowl. Based Syst..

[5]  Ramayya Krishnan,et al.  A method for managing access to web pages: Filtering by Statistical Classification (FSC) applied to text , 2006, Decis. Support Syst..

[6]  Ramayya Krishnan,et al.  Internet content filtering using isotonic separation on content category ratings , 2007, TOIT.

[7]  Jianping Zeng,et al.  Tag tree template for Web information and schema extraction , 2010, Expert Syst. Appl..

[8]  Sofiène Tahar,et al.  Rank Functions Based Inference System for Group Key Management Protocols Verification , 2009, Int. J. Netw. Secur..

[9]  Liming Chen,et al.  WebGuard: a Web filtering engine combining textual, structural, and visual content-based analysis , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Arno R. Lodder,et al.  Governmental filtering of websites: The Dutch case , 2009, Comput. Law Secur. Rev..

[11]  Jianping Zeng,et al.  Topics modeling based on selective Zipf distribution , 2012, Expert Syst. Appl..

[12]  Reihaneh Safavi-Naini,et al.  Web filtering using text classification , 2003, The 11th IEEE International Conference on Networks, 2003. ICON2003..

[13]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[14]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[15]  Kerstin Denecke,et al.  Topic detection in noisy data sources , 2010, 2010 Fifth International Conference on Digital Information Management (ICDIM).

[16]  Jianping Zhang,et al.  The Role of URLs in Objectionable Web Content Categorization , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[17]  Xin Jin,et al.  Sensitive webpage classification for content advertising , 2007, ADKDD '07.

[18]  Lung-Hao Lee,et al.  Collaborative cyberporn filtering with collective intelligence , 2011, SIGIR.

[19]  Andrew M. Dai,et al.  Proceedings of NIPS Workshop on Applications for Topic Models Text and Beyond , 2009 .

[20]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[21]  Jianping Zeng,et al.  Semantic multi-grain mixture topic model for text analysis , 2011, Expert Syst. Appl..

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Zhouyu Fu,et al.  Recognition of Pornographic Web Pages by Classifying Texts and Images , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  S.C. Hui,et al.  Web mining for cyber monitoring and filtering , 2004, IEEE Conference on Cybernetics and Intelligent Systems, 2004..

[25]  Jianping Zeng,et al.  A Method for Determination on HMM Distance Threshold , 2009, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[26]  Huan Liu,et al.  Blocking objectionable web content by leveraging multiple information sources , 2006, SKDD.

[27]  Weiming Hu,et al.  Web sensitive text filtering by combining semantics and statistics , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[28]  Daniel J. Weitzner Free Speech and Child Protection on the Web , 2007, IEEE Internet Computing.

[29]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[30]  Donghui Guo,et al.  Agent-based Intrusion Detection For Network-based Application , 2009, Int. J. Netw. Secur..