The Identification of Pornographic Sentences in Bahasa Indonesia

Abstract The positive and negative content is mixed in the Internet world. The government of Indonesia notices that negative content is a potential issue that might threaten Internet users. The government launches several services such as DNS Nawala and TRUST+™ Positif database. However, government action is not enough because of the validation of the TRUST+™ Positif database requires many human resources. This research is the beginning of the identification of negative content on a web page. It provides the core system to determine the category of a sentence, which is pornography or non-pornography The research begins with the corpus building, continued with the data training model, and the last is data testing. The corpus is downloaded from the pornographic websites from the TRUST+™ Positif database. Moreover, we tested the identification process by using K-Nearest Neighbor (KNN), Passive Aggressive Classifier, and Support Vector Machine (SVM). Both Passive Aggressive Classifier and SVM show an excellent performance. Meanwhile, KNN yields a mediocre result. The SVM algorithm has the highest accuracy of 98.25%.