File Name Classification Approach to Identify Child Sexual Abuse

When Law Enforcement Agencies seize a computer machine from a potential producer or consumer of Child Sexual Exploitation Material (CSEM), they need accurate and time-efficient tools to analyze its files. However, classifying and detecting CSEM by manual inspection is a high time-consuming task, and most of the time, it is unfeasible in the amount of time available for Spanish police using a search warrant. An option for identifying CSEM is to analyze the names of the files stored in the hard disk of the suspect person, looking in the text for patterns related to CSEM. However, due to the particularity of this file names, mainly its length and the use of obfuscated words, current file name classification methods suffer from a low recall rate, which is essential in the context of this problem. This paper presents our ongoing research to identify CSEM through their file names. We evaluate two approaches of short text classification: a proposal based on machine learning classifiers exploring the use of Logistic Regression and Support Vector Machine and an approach using deep learning by adapting two popular Convolutional Neural Network (CNN) models that work on character-level. The presented CNN achieved an average class recall of 0.86 and a recall rate of 0.78 for the CSEM class. The CNN based classifier could be integrated into forensic tools and services that might support Law Enforcement Agencies to identify CSEM without the need to access systematically to the visual content of every file.

[1]  Alexander Panchenko,et al.  Detection of Child Sexual Abuse Media on P2P Networks: Normalization and Classification of Associated Filenames , 2012 .

[2]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3]  Qiang Shen,et al.  A Rough Set-Based Approach to Text Classification , 1999, RSFDGrC.

[4]  LiMin,et al.  Feature selection via maximizing global information gain for text classification , 2013 .

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Shehzad Khalid,et al.  News classification based on their headlines: A review , 2014, 17th IEEE International Multi Topic Conference 2014.

[7]  Eduardo Fidalgo,et al.  Pornography and child sexual abuse detection in image and video: A comparative evaluation , 2017, ICDP.

[8]  Deisy Chaves,et al.  Improving speed-accuracy trade-off in face detectors for forensic tools by image resizing , 2019 .

[9]  Gan Keng Hoon,et al.  Term weighting scheme for short-text classification: Twitter corpuses , 2019, Neural Computing and Applications.

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Eduardo Fidalgo,et al.  Boosting image classification through semantic attention filtering strategies , 2018, Pattern Recognit. Lett..

[13]  Eduardo Fidalgo,et al.  Classifying Illegal Activities on Tor Network Based on Web Textual Contents , 2017, EACL.

[14]  Jun Luo,et al.  An Active Learning Based on Uncertainty and Density Method for Positive and Unlabeled Data , 2018, ICA3PP.

[15]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[16]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[17]  Thamar Solorio,et al.  A Multi-task Approach for Named Entity Recognition in Social Media Data , 2017, NUT@EMNLP.

[18]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[19]  María F Cabello,et al.  [Child sexual exploitation]. , 2009, Vertex.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Hao Chen,et al.  Harnessing the Power of Text Mining for the Detection of Abusive Content in Social Media , 2016, UKCI.

[22]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[23]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[24]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[25]  Eduardo Fidalgo,et al.  ToRank: Identifying the most influential suspicious domains in the Tor network , 2019, Expert Syst. Appl..

[26]  Qingshan Jiang,et al.  Feature selection via maximizing global information gain for text classification , 2013, Knowl. Based Syst..

[27]  Enrique Alegre Gutiérrez,et al.  Fusión temprana de descriptores extraídos de mapas de prominencia multi-nivel para clasificar imágenes , 2018 .

[28]  Ka-Chun Wong,et al.  Verbal aggression detection on Twitter comments: convolutional neural network for short-text sentiment analysis , 2018, Neural Computing and Applications.

[29]  Awais Rashid,et al.  iCOP: Automatically Identifying New Child Abuse Media in P2P Networks , 2014, 2014 IEEE Security and Privacy Workshops.

[30]  Awais Rashid,et al.  iCOP: Live forensics to reveal previously unknown criminal media on P2P networks , 2016, Digit. Investig..

[31]  Eduardo Fidalgo,et al.  Textile Retrieval Based on Image Content from CDC and Webcam Cameras in Indoor Environments , 2018, Sensors.

[32]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[33]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[34]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[35]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[36]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[37]  Franck Dernoncourt,et al.  Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks , 2016, NAACL.

[38]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.