Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model

With the flooding of pornographic information on the Internet, how to keep people away from that offensive information is becoming one of the most important research areas in network information security. Some applications which can block or filter such information are used. Approaches in those systems can be roughly classified into two kinds: metadata based and content based. With the development of distributed technologies, content based filtering technologies will play a more and more important role in filtering systems. Keyword matching is a content based method used widely in harmful text filtering. Experiments to evaluate the recall and precision of the method showed that the precision of the method is not satisfactory, though the recall of the method is rather high. According to the results, a new pornographic text filtering model based on reconfirming is put forward. Experiments showed that the model is practical, has less loss of recall than the single keyword matching method, and has higher precision.

[1]  Dunja Mladenic,et al.  Text-learning and related intelligent agents: a survey , 1999, IEEE Intell. Syst..

[2]  Amos Fiat,et al.  Censorship resistant peer-to-peer content addressable networks , 2002, SODA '02.

[3]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[4]  Ah-Hwee Tan,et al.  Machine Learning Methods for Chinese Web page Categorization , 2000, ACL 2000.

[5]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[6]  David Mazières,et al.  Tangler: a censorship-resistant publishing system based on document entanglements , 2001, CCS '01.

[7]  Jing Deng,et al.  Centralized content-based Web filtering and blocking: how far can it go? , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[8]  Ah-Hwee Tan,et al.  A Comparative Study on Chinese Text Categorization Methods , 2000, PRICAI Workshop on Text and Web Mining.

[9]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[10]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[11]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Aviel D. Rubin,et al.  Publius: a robust, tamper-evident, censorship-resistant web publishing system , 2000 .

[14]  Ah-Hwee Tan,et al.  On Machine Learning Methods for Chinese Document Categorization , 2003, Applied Intelligence.

[15]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[16]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[17]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[18]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[19]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[20]  NgHwee Tou,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997 .