System of negative Indonesian website detection using TF-IDF and Vector Space Model

Systems to filter negative (pornography) websites are widely established by several researchers. However, those systems are developed for English websites. There is a system to filter negative Indonesian website. However, it works based on URL database. This research developed negative Indonesian website filter which is based on content filtering using TF-IDF (Term Frequency-Inverse Document Frequency) and VSM (Vector Space Model). The accuracy of the system classification is 82.80%.

[1]  Sanjay Kumar Madria,et al.  An Improved Algorithm to Term Weighting in Text Classification , 2010, 2010 International Conference on Multimedia Technology.

[2]  Youngsoo Kim,et al.  An efficient text filter for adult Web documents , 2006, 2006 8th International Conference Advanced Communication Technology.

[3]  Li-Ping Jing,et al.  Improved feature selection approach TFIDF in text mining , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[4]  Huicheng Zheng,et al.  Blocking objectionable images: adult images and harmful symbols , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[5]  Bo Xu,et al.  Automated Chinese Essay Scoring using Vector Space Models , 2010, 2010 4th International Universal Communication Symposium.

[6]  Jantima Polpinij,et al.  A web pornography patrol system by content-based analysis: In particular text and image , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[7]  Zhouyu Fu,et al.  Recognition of Pornographic Web Pages by Classifying Texts and Images , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ling Zhang,et al.  Document indexing in text categorization , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[9]  Paul A. Watters,et al.  Statistical and structural approaches to filtering Internet pornography , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  C. Chantrapornchai,et al.  Experimental studies on pornographic web filtering techniques , 2008, 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[12]  Pasi Fränti,et al.  Web Data Mining , 2009, Encyclopedia of Database Systems.

[13]  Thomas S. Morton,et al.  Taming Text: How to Find, Organize, and Manipulate It , 2013 .