论文信息 - Recognizing Predatory Chat Documents using Semi-supervised Anomaly Detection

Recognizing Predatory Chat Documents using Semi-supervised Anomaly Detection

Chat-logs are informative documents available to nowadays social network providers. Providers and law enforcement tend to use these huge logs anonymously for automatic online Sexual Predator Identification (SPI) which is a relatively new area of application. The task plays an important role in protecting children and juveniles against being exploited by online predators. Pattern recognition techniques facilitate automatic identification of harmful conversations in cyber space by law enforcements. These techniques usually require a large volume of high-quality training instances of both predatory and non-predatory documents. However, collecting non-predatory documents is not practical in real-world applications, since this category contains a large variety of documents with many topics including politics, sports, science, technology and etc. We utilized a new semi-supervised approach to mitigate this problem by adapting an anomaly detection technique called One-class Support Vector Machine which does not require non-predatory samples for training. We compared the performance of this approach against other state-ofthe-art methods which use both positive and negative instances. We observed that although anomaly detection approach utilizes only one class label for training (which is a very desirable property in practice); its performance is comparable to that of binary SVM classification. In addition, this approach outperforms the classic two-class Naïve Bayes algorithm, which we used as our baseline, in terms of both classification accuracy and precision. Introduction During the past decade, automated online Sexual Predator Identification from chat documents has boomed by means of pattern recognition techniques capable of flagging likely predators for the attention of law enforcement. The most common approach has been presented in PAN-2012 international competition [1] which was specifically engineered to accomplish the following two tasks [2]: Finding the predators vs. victims Finding the predatory messages in a predatory document The first task seems to be more important for law enforcement since it can help them to limit their search space drastically. It is worth mentioning that the second task has not been as successful as the first one due to the fact that it requires deeper natural language analysis. The first task can be performed in two steps [3]: Identifying the predatory documents in the entire conversation corpus Searching in participants of predatory documents in order to distinguish the sexual predator and victim In this paper we focus on the first step mentioned above (i.e. identifying the predatory conversations), since it will be the most proper area for helping the investigators in real-world applications. Accordingly, the main motivation behind using One-class SVM on this kind of data and treating the problem as an anomaly detection problems is making a classifier which is able to learn from only one class label instead of what we have in the traditional binary classification. Figure 1 depicts the different granularity levels for designing classifiers in online sexual predator identification. Figure 1. Classification Granularity Levels and their corresponding classification problem in SPI Section 2 describes the current status of SPI, section 3 explains the proposed approach which is based on semi-supervised anomaly detection, and section 4 dissects the document recognition process we conducted on SPI problem including pre-processing, feature extraction and pattern classification. Also, the result of comparing different methods is described in this section. Motivation According to researchers who participated in PAN-2012, There has been a major weakness in the data set: The nonpredatory and non-sexual samples were exclusively gathered from publicly available IRC logs which mainly contain the chats about computer and web technologies; therefore cannot represent “general conversations” [4]. The samples in general conversation category (which are also non-predatory) must include countless topics such as sport, music, games, computer, etc. In practice, it is not an easy task to assemble such a training data set. As a result, the current top-ranked algorithms in PAN2012 may have learned how to distinguish computer-related chats vs. sexual-related chats instead of identifying actual predatory chats in online cyber space. Accordingly, one can expect that their performance will decrease in real-world applications. In other words, we believe that although the top-ranked algorithms in PAN-2012 had significant F1-score on test data set (87% for the winner), since they require general samples that are able to represent the non-predatory data properly, their performance will decrease significantly in practical ©2016 Society for Imaging Science and Technology IS&T International Symposium on Electronic Imaging 2016 Document Recognition and Retrieval XXIII DRR-063.1 environments such as law enforcement. In this work, we propose a novel way to handle this problem by eliminating the need for having both class labels in the train data set. Due to the absence of one of the class labels in the training process, our applied method will be more practical at the expense of having a lower, but still acceptable, F1-score. Using only one class label in training process categorizes this approach as a semi-supervised classification method. Furthermore, in order to guarantee the efficiency of our approach we aim to beat the baseline (naïve Bayes algorithm) in terms of F1-Score. Note that each chat conversation represents a document in our recognition process; hence, in the remaining parts of this paper we use document and conversations interchangeably. Related Work Perhaps the first successful attempt for using machine learning in SPI problem was done by Pendar by means of weighted K-NN classifier to distinguish predators from underage victims [5]. To the best of our knowledge, the first empirical system with capability of determining predatory messages in chat logs is ChatCoder1 (and Chatcoder2) implemented and evolved by Kontostathis and her colleagues [6] [7]. The system uses a rule based approach in conjunction with decision trees and instancebased learning methods (K-NN). It is worth mentioning that in order to deal with the issue of learning imbalance data, [8] has already introduced a general approach using a weighted version of KNN algorithm to mitigate the problem of imbalanced data in text categorization which is not specifically related to the SPI. Recently, the PAN-2012 conference has acted as a boost for applying machine learning techniques to this area. The main strength of this conference is providing the first publicly available official data set which was specifically engineered for sexual predator identification task. Researchers tuned their proposed methods against the same training data and reported their performance on the test data. Several machine learning algorithms have been used to solve SPI problem in this competition. These algorithms cover a wide range of classification algorithms such as maximum entropy-based classification [9], K-NN [10], Support Vector Machine [4] and Neural Networks [3]. Eventually, one team has been announced as the winner based on their classification accuracy and an augmented F-measure. The winner team [3] has used a two-step binary classification approach called SCI (Suspicious Conversation Identification) and VFP (Victim From Predator Disclosure) using SVM and Neural Networks. Accordingly we have used SVM as the state-of-the-art method to compare the performance of our anomaly detection approach with. Escalante and his colleagues [11] proposed a new method based on learning a chain of three local classifiers corresponding to three segments of each document (i.e. conversation) but the approach could not outperform that of the winner in PAN-2012. A related research has been done on cyber bullying by Kontostathis which is very close to predator identification [12]. They utilize a different supervised learning algorithm based on latent Semantic Indexing which is called Essential Dimensions of LSI for identifying cyber bullying. They built their own data set using Form spring.me, a questin-and-answer popular website. As the most recent work, [13] have proposed enriching the traditional bag-of-word language model by adding other feature types including sentiment features, psycho-linguistic features and discourse patterns. Eventually, they have used binary classification for the actual predator identification task. Generally, the algorithms used in PAN-2012 can be considered as the state of the art in sexual predator identification. While in regard to anomaly detection, there is a wide variety of unsupervised, supervised, and semi-supervised models. A comprehensive survey of anomaly detection has been done in [14]. The authors have categorized the anomaly detection methods into six major categories: clustering based, classification based, nearest neighbor based (also includes density based methods), statistical, Information theoretic and spectral methods. We use a slightly different taxonomy to show the place of the method we use based on the learning method that is used for anomaly detection. We avoid describing different methods and foundations of anomaly detection since it is beyond the scope of this article. Instead, we focus on the specific anomaly detection method (i.e. one-class SVM) that yielded the desirable results in this application domain. Figure 2 illustrates the taxonomy of most common anomaly detection techniques as well as the position of semi-supervised techniques. Figure 2. Position of Semi-supervised and SVM-based techniques in the taxonomy of anomaly detection techniques One-class SVM has been highlighted in the figure. For the sake of completeness, the unsupervised SVM-based algorithms are shown as well. The corresponding leaf nodes of the taxonomy will

[1] Nancy Chinchor,et al. MUC-4 evaluation metrics , 1992, MUC.

[2] Bernhard Schölkopf,et al. Support Vector Method for Novelty Detection , 1999, NIPS.

[3] Bernhard Schölkopf,et al. New Support Vector Algorithms , 2000, Neural Computation.

[4] Bernhard Schölkopf,et al. Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[5] Songbo Tan,et al. Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[6] Ingo Mierswa,et al. YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[7] Alexander Zien,et al. A continuation method for semi-supervised SVMs , 2006, ICML.

[8] Nick Pendar,et al. Toward Spotting the Pedophile Telling victim from predator in text chats , 2007, International Conference on Semantic Computing (ICSC 2007).

[9] Lynne Edwards,et al. ChatCoder: Toward the Tracking and Categorization of Internet Predators , 2009 .

[10] VARUN CHANDOLA,et al. Anomaly detection: A survey , 2009, CSUR.

[11] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[12] April Kontostathis,et al. Learning to Identify Internet Sexual Predation , 2011, Int. J. Electron. Commer..

[13] Hugo Jair Escalante,et al. A Two-step Approach for Effective Detection of Misbehaving Users in Chats , 2012, CLEF.

[14] Seung-Hoon Na,et al. IR-based k-Nearest Neighbor Approach for Identifying Abnormal Chat Users , 2012, CLEF.

[15] Fabio Crestani,et al. Overview of the International Sexual Predator Identification Competition at PAN-2012 , 2012, CLEF.

[16] Cristian Grozea,et al. Kernel Methods and String Kernels for Authorship Analysis , 2012, CLEF.

[17] Graeme Hirst,et al. Identifying Sexual Predators by SVM Classification with Lexical and Behavioral Features , 2012, CLEF.

[18] Amit P. Sheth,et al. Topical anomaly detection from Twitter stream , 2012, WebSci '12.

[19] Gunnar Eriksson,et al. Features for Modelling Characteristics of Conversations , 2012, CLEF.

[20] Barbara Poblete,et al. On-line relevant anomaly detection in the Twitter stream: an efficient bursty keyword detection model , 2013, ODD '13.

[21] Hugo Jair Escalante,et al. Sexual predator detection in chats with chained classifiers , 2013, WASSA@NAACL-HLT.

[22] Slim Abdennadher,et al. Enhancing one-class support vector machines for unsupervised anomaly detection , 2013, ODD '13.

[23] Marius Kloft,et al. Toward Supervised Anomaly Detection , 2014, J. Artif. Intell. Res..

[24] Kelly Reynolds,et al. Detecting cyberbullying: query terms and techniques , 2013, WebSci.

[25] Harith Alani,et al. Detecting Child Grooming Behaviour Patterns on Social Media , 2014, SocInfo.

[26] Sriraam Natarajan,et al. Anomaly Detection in Text: The Value of Domain Knowledge , 2015, FLAIRS.