Detecting sexual predators in chats using behavioral features and imbalanced learning*

Abstract This paper presents a system developed for detecting sexual predators in online chat conversations using a two-stage classification and behavioral features. A sexual predator is defined as a person who tries to obtain sexual favors in a predatory manner, usually with underage people. The proposed approach uses several text categorization methods and empirical behavioral features developed especially for the task at hand. After investigating various approaches for solving the sexual predator identification problem, we have found that a two-stage classifier achieves the best results. In the first stage, we employ a Support Vector Machine classifier to distinguish conversations having suspicious content from safe online discussions. This is useful as most chat conversations in real life do not contain a sexual predator, therefore it can be viewed as a filtering phase that enables the actual detection of predators to be done only for suspicious chats that contain a sexual predator with a very high degree. In the second stage, we detect which of the users in a suspicious discussion is an actual predator using a Random Forest classifier. The system was tested on the corpus provided by the PAN 2012 workshop organizers and the results are encouraging because, as far as we know, our solution outperforms all previous approaches developed for solving this task.

[1]  Mark G. Core,et al.  Coding Dialogs with the DAMSL Annotation Scheme , 1997 .

[2]  L Alvin Malesky,et al.  Predatory Online Behavior: Modus Operandi of Convicted Sex Offenders in Identifying Potential Victims and Contacting Minors Over the Internet , 2007, Journal of child sexual abuse.

[3]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Hugo Jair Escalante,et al.  Sexual predator detection in chats with chained classifiers , 2013, WASSA@NAACL-HLT.

[5]  Lee Gillam,et al.  Quite Simple Approaches for Authorship Attribution, Intrinsic Plagiarism Detection and Sexual Predator Identification , 2012, CLEF.

[6]  Paolo Rosso,et al.  On the Impact of Sentiment and Emotion Based Features in Detecting Online Sexual Predators , 2012, WASSA@ACL.

[7]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[8]  Kelly Reynolds,et al.  Identifying Predators Using ChatCoder 2.0 , 2012, CLEF.

[9]  Michele L. Ybarra,et al.  Online "predators" and their victims: myths, realities, and implications for prevention and treatment. , 2008, The American psychologist.

[10]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[11]  Monica T. Whitty,et al.  Liar, liar! An examination of how open, supportive and honest people are in chat rooms , 2002, Comput. Hum. Behav..

[12]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[13]  J. Wolak,et al.  Internet-initiated sex crimes against minors: implications for prevention based on findings from a national study. , 2004, The Journal of adolescent health : official publication of the Society for Adolescent Medicine.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Hugo Jair Escalante,et al.  A Two-step Approach for Effective Detection of Misbehaving Users in Chats , 2012, CLEF.

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  David E. Losada,et al.  A Learning-Based Approach for the Identification of Sexual Predators in Chat Logs , 2012, CLEF.

[18]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[19]  Walter Daelemans,et al.  Conversation Level Constraints on Pedophile Detection in Chat Rooms , 2012, CLEF.

[20]  Gunnar Eriksson,et al.  Features for Modelling Characteristics of Conversations , 2012, CLEF.

[21]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[22]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[23]  Lee Gillam,et al.  "Our Little Secret": pinpointing potential predators , 2014, Security Informatics.

[24]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[25]  J. Wolak,et al.  Youth Internet users at risk for the most serious online sexual solicitations. , 2007, American journal of preventive medicine.

[26]  D. Finkelhor,et al.  The Victimization of Children and Youth: A Comprehensive, National Survey , 2005, Child maltreatment.

[27]  Cristian Grozea,et al.  Kernel Methods and String Kernels for Authorship Analysis , 2012, CLEF.

[28]  Graeme Hirst,et al.  Identifying Sexual Predators by SVM Classification with Lexical and Behavioral Features , 2012, CLEF.

[29]  Igor Kononenko,et al.  Cost-Sensitive Learning with Neural Networks , 1998, ECAI.

[30]  A. Beech,et al.  A review of online grooming: Characteristics and concerns , 2013 .

[31]  Erik Cambria,et al.  Sentic Computing: Techniques, Tools, and Applications , 2012 .

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Uffe Kock Wiil,et al.  Criminal network investigation , 2014, Security Informatics.

[34]  Fabio Crestani,et al.  Overview of the International Sexual Predator Identification Competition at PAN-2012 , 2012, CLEF.

[35]  Paolo Rosso,et al.  Modelling Fixated Discourse in Chats with Cyberpedophiles , 2012 .

[36]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[37]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[38]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.