Adult Content Filtering through Compression-Based Text Classification

Internet is a powerful source of information. However, some of the information that is available in the Internet, cannot be shown to every type of public. For instance, pornography is not desirable to be shown to children. To this end, several algorithms for text filtering have been proposed that employ a Vector Space Model representation of the webpages. Nevertheless, these type of filters can be surpassed using different attacks. In this paper, we present the first adult content filtering tool that employs compression algorithms to represent data that is resilient to these attacks. We show that this approach enhances the results of classic VSM models.

[1]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[2]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .

[3]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[4]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[5]  Arvinder Kaur,et al.  Comparative analysis of regression and machine learning methods for predicting fault proneness models , 2009, Int. J. Comput. Appl. Technol..

[6]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[7]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[8]  Youngsoo Kim,et al.  An efficient text filter for adult Web documents , 2006, 2006 8th International Conference Advanced Communication Technology.

[9]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[10]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[11]  Francisco Herrera,et al.  On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining , 2006, Appl. Soft Comput..

[12]  Paul A. Watters,et al.  Statistical and structural approaches to filtering Internet pornography , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[13]  B. Jedynak,et al.  Blocking Adult Images Based on Statistical Skin Detection , 2004 .

[14]  Jaecheol Ryou,et al.  Adult Image Detection Using Bayesian Decision Rule Weighted by SVM Probability , 2009, 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology.

[15]  W. John Wilbur,et al.  The automatic identification of stop words , 1992, J. Inf. Sci..

[16]  Pau-Choo Chung,et al.  Naked image detection based on adaptive and extensible skin color model , 2007, Pattern Recognit..

[17]  Manuel de Buenaga Rodríguez,et al.  Web Content Filtering , 2009, Adv. Comput..

[18]  Piotr Jędrzejowicz,et al.  Instance reduction approach to machine learning and multi-database mining , 2006, Ann. UMCS Informatica.

[19]  Reihaneh Safavi-Naini,et al.  Web filtering using text classification , 2003, The 11th IEEE International Conference on Networks, 2003. ICON2003..

[20]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[21]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[22]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[23]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[24]  Xizhao Wang,et al.  OFFSS: optimal fuzzy-valued feature subset selection , 2003, IEEE Trans. Fuzzy Syst..

[25]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[26]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[27]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[28]  Gregory F. Cooper,et al.  A Bayesian Method for Constructing Bayesian Belief Networks from Databases , 1991, UAI.

[29]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[30]  Shyhtsun Felix Wu,et al.  On Attacking Statistical Spam Filters , 2004, CEAS.

[31]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[32]  Wen Gao,et al.  Adult Image Detection Method Base-on Skin Color Model and Support Vector Machine , 2001 .

[33]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[34]  Elisa Bertino,et al.  Web Content Filtering , 2006 .