Applying automatic text-based detection of deceptive language to police reports: Extracting behavioral patterns from a multi-step classification model to understand how we lie to the police

VeriPol is an effective text-based lie detection model for police reports.Our model includes feature selection by L1 penalization and heuristic rules.Computational experiments on a real dataset show a validation accuracy of 91.A pilot study shows a lower bound on the empirical precision of 83%, approx.The model analysis provides linguistic insights of how people lie to the police. Filing a false police report is a crime that has dire consequences on both the individual and the system. In fact, it may be charged as a misdemeanor or a felony. For the society, a false report results in the loss of police resources and contamination of police databases used to carry out investigations and assessing the risk of crime in a territory. In this research, we present VeriPol, a model for the detection of false robbery reports based solely on their text. This tool, developed in collaboration with the Spanish National Police, combines Natural Language Processing and Machine Learning methods in a decision support system that provides police officers the probability that a given report is false. VeriPol has been tested on more than 1000 reports from 2015 provided by the Spanish National Police. Empirical results show that it is extremely effective in discriminating between false and true reports with a success rate of more than 91%, improving by more than 15% the accuracy of expert police officers on the same dataset. The underlying classification model can be analysed to extract patterns and insights showing how people lie to the police (as well as how to get away with false reporting). In general, the more details provided in the report, the more likely it is to be honest. Finally, a pilot study carried out in June 2017 has demonstrated the usefulness of VeriPol on the field.

[1]  Dongsong Zhang,et al.  A Statistical Language Modeling Approach to Online Deception Detection , 2008, IEEE Transactions on Knowledge and Data Engineering.

[2]  Mara Hvistendahl Crime forecasters. , 2016, Science.

[3]  Yimin Chen,et al.  Deception detection for news: Three types of fakes , 2015, ASIST.

[4]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[5]  Carter C. Price,et al.  Predictive Policing: The Role of Crime Forecasting in Law Enforcement Operations , 2013 .

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Carlo Strapparava,et al.  The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language , 2009, ACL.

[8]  Federico Liberatore,et al.  A Decision Support System for predictive police patrolling , 2015, Decis. Support Syst..

[9]  Jeffrey T. Hancock,et al.  Reading between the lines: linguistic cues to deception in online dating profiles , 2010, CSCW '10.

[10]  Paolo Rosso,et al.  Detection of Opinion Spam with Character n-grams , 2015, CICLing.

[11]  Jeffrey T. Hancock,et al.  On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication , 2007 .

[12]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[13]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[14]  Tommaso Fornaciari,et al.  Automatic Detection of Verbal Deception , 2015, Automatic Detection of Verbal Deception.

[15]  Yi Yang,et al.  Learning to Identify Review Spam , 2011, IJCAI.

[16]  Michal Tomana,et al.  Influence of Word Normalization on Text Classification , 2007 .

[17]  Claire Cardie,et al.  Negative Deceptive Opinion Spam , 2013, NAACL.

[18]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[19]  Claire Cardie,et al.  Towards a General Rule for Identifying Deceptive Opinion Spam , 2014, ACL.

[20]  James J. Lindsay,et al.  Cues to deception. , 2003, Psychological bulletin.

[21]  R. Ofshe,et al.  The Social Psychology of Police Interrogation: The Theory and Classification of True and False Confessions , 2008 .

[22]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[23]  Taghi M. Khoshgoftaar,et al.  Survey of review spam detection using machine learning techniques , 2015, Journal of Big Data.

[24]  Taghi M. Khoshgoftaar,et al.  Cross-Domain Sentiment Analysis: An Empirical Investigation , 2016, 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI).

[25]  J. Nunamaker,et al.  Automating Linguistics-Based Cues for Detecting Deception in Text-Based Asynchronous Computer-Mediated Communications , 2004 .

[26]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[27]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[28]  Shuigeng Zhou,et al.  Effectively classifying short texts by structured sparse representation with dictionary filtering , 2015, Inf. Sci..

[29]  Paolo Rosso,et al.  Detecting positive and negative deceptive opinions using PU-learning , 2015, Inf. Process. Manag..

[30]  Paolo Rosso,et al.  Detecting Deceptive Opinions: Intra and Cross-Domain Classification Using an Efficient Representation , 2017, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Arjun Mukherjee,et al.  What Yelp Fake Review Filter Might Be Doing? , 2013, ICWSM.

[33]  Nigel Collier,et al.  Sentiment Analysis using Support Vector Machines with Diverse Information Sources , 2004, EMNLP.

[34]  A. Vrij,et al.  Outsmarting the Liars: The Benefit of Asking Unanticipated Questions , 2009, Law and human behavior.

[35]  Taghi M. Khoshgoftaar,et al.  An Investigation of Ensemble Techniques for Detection of Spam Reviews , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[36]  Dipankar Das,et al.  A Practical Guide to Sentiment Analysis , 2017 .

[37]  Paolo Rosso,et al.  Deception Detection and Opinion Spam , 2017 .

[38]  Christopher D. Manning,et al.  Advances in natural language processing , 2015, Science.

[39]  Ali Selamat,et al.  Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples , 2015, Inf. Sci..

[40]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[41]  Wei Ding,et al.  Crime Forecasting Using Data Mining Techniques , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[42]  D. Lisak,et al.  False Allegations of Sexual Assault: An Analysis of Ten Years of Reported Cases , 2010, Violence against women.

[43]  Massimo Poesio,et al.  Automatic deception detection in Italian court cases , 2013, Artificial Intelligence and Law.

[44]  Science and the law. fMRI lie detection fails a legal test. , 2010, Science.

[45]  Naomie Salim,et al.  Detection of review spam: A survey , 2015, Expert Syst. Appl..

[46]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[47]  J. Pennebaker,et al.  Lying Words: Predicting Deception from Linguistic Styles , 2003, Personality & social psychology bulletin.

[48]  Johan A. K. Suykens,et al.  Benchmarking state-of-the-art classification algorithms for credit scoring , 2003, J. Oper. Res. Soc..

[49]  Cindy K. Chung,et al.  The development and psychometric properties of LIWC2007 , 2007 .

[50]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[51]  Robert L. Kane,et al.  Patterns of arrest in domestic violence encounters: Identifying a police decision-making model , 1999 .

[52]  Rada Mihalcea,et al.  Random Walk Term Weighting for Improved Text Classification , 2007, Int. J. Semantic Comput..

[53]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[54]  I. Sigfusdottir,et al.  Custodial interrogation: What are the background factors associated with claims of false confession to police? , 2007 .

[55]  Wilpen L. Gorr,et al.  Introduction to crime forecasting , 2003 .

[56]  Chandler May,et al.  Social Bias in Elicited Natural Language Inferences , 2017, EthNLP@EACL.

[57]  Wilpen L. Gorr,et al.  Leading Indicators and Spatial Interactions: A Crime‐Forecasting Model for Proactive Police Deployment , 2007 .

[58]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[59]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[60]  F. Liberatore,et al.  A Comparison of Local Search Methods for the Multicriteria Police Districting Problem on Graph , 2016 .