Statistical and structural approaches to filtering Internet pornography

The WWW is a major source of unintentional exposure to pornography. Current content filtering technology using blacklisting or simple keyword searching is ineffective - today's filters have many false positives and negatives, and require tedious manual updating. This study examined how content filtering of pornographic Web page text, based on structural and statistical analysis, could greatly improve accuracy. Systematic differences between pornographic and nonpornographic Web pages were found, with Bayesian classification yielding 99.1% accuracy in text classification from pornographic and non-pornographic corpora

[1]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[2]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[3]  Oren Etzioni,et al.  The World-Wide Web: quagmire or gold mine? , 1996, CACM.

[4]  Liu Zhijing,et al.  Web mining research , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[5]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[6]  Paul Greenfield,et al.  Effectiveness of internet filtering software products , 2001 .

[7]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[8]  Paul A. Watters Discriminating English Word Senses Using Cluster Analysis , 2002, J. Quant. Linguistics.