The art of creating an informative data collection for automated deception detection: A corpus of truths and lies

One of the novel research directions in Natural Language Processing and Machine Learning involves creating and developing methods for automatic discernment of deceptive messages from truthful ones. Mistaking intentionally deceptive pieces of information for authentic ones (true to the writer’s beliefs) can create negative consequences, since our everyday decision-making, actions, and mood are often impacted by information we encounter. Such research is vital today as it aims to develop tools for the automated recognition of deceptive, disingenuous or fake information (the kind intended to create false beliefs or conclusions in the reader’s mind). The ultimate goal is to support truthfulness ratings that signal the trustworthiness of the retrieved information, or alert information seekers to potential deception. To proceed with this agenda, we require elicitation techniques for obtaining samples of both deceptive and truthful messages from study participants in various subject areas. A data collection, or a corpus of truths and lies, should meet certain basic criteria to allow for meaningful analysis and comparison of socio-linguistic behaviors. In this paper we propose solutions and weigh pros and cons of various experimental set-ups in the art of corpus building. The outcomes of three experiments demonstrate certain limitations with using online crowdsourcing for data collection of this type. Incorporating motivation in the task descriptions, and the role of visual context in creating deceptive narratives are other factors that should be addressed in future efforts to build a quality dataset.

[1]  Allan H. Gilbert,et al.  Studies In Iconology: Humanistic Themes In The Art Of The Renaissance , 1939 .

[2]  Sara Shatford Layne,et al.  Some Issues in the Indexing of Images , 1994, J. Am. Soc. Inf. Sci..

[3]  Sara Shatford Layne Some issues in the indexing of images , 1994 .

[4]  Arnaud D'Argembeau,et al.  PHENOMENAL CHARACTERISTICS OF AUTOBIOGRAPHICAL MEMORIES FOR EMOTIONAL AND NEUTRAL EVENTS IN OLDER AND YOUNGER ADULTS , 2005, Experimental aging research.

[5]  Victoria L. Rubin,et al.  Identification of Truth and Deception in Text: Application of Vector Space Model to Rhetorical Structure Theory , 2012 .

[6]  Eileen Fitzpatrick,et al.  Building a Data Collection for Deception Research , 2012 .

[7]  Victoria L. Rubin Epistemic modality: From uncertainty to certainty in the context of information seeking as interactions with texts , 2010, Inf. Process. Manag..

[8]  Elaine Svenonius Access to nonbook materials: the limits of subject indexing for visual and aural languages , 1994 .

[9]  Janyce Wiebe,et al.  A Corpus Study of Evaluative and Speculative Language , 2001, SIGDIAL Workshop.

[10]  Thomas Hugh Feeley,et al.  Individual and Small Group Accuracy in Judging Truthful and Deceptive Communication , 2004 .

[11]  James Pustejovsky,et al.  FactBank: a corpus annotated with event factuality , 2009, Lang. Resour. Evaluation.

[12]  J. Pennebaker,et al.  Lying Words: Predicting Deception from Linguistic Styles , 2003, Personality & social psychology bulletin.

[13]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[14]  J. Nunamaker,et al.  Automating Linguistics-Based Cues for Detecting Deception in Text-Based Asynchronous Computer-Mediated Communications , 2004 .

[15]  A. Vrij Detecting Lies and Deceit: The Psychology of Lying and the Implications for Professional Practice , 2000 .

[16]  Claire Cardie,et al.  In Search of a Gold Standard in Studies of Deception , 2012 .

[17]  Timothy R. Levine,et al.  The Language of Truthful and Deceptive Denials and Confessions , 2008 .

[18]  Erwin Panofsky,et al.  Studies in Iconology. Humanistic Themes in the Art of the Renaissance. , 1939 .

[19]  Victoria L. Rubin On deception and deception detection: Content analysis of computer-mediated stated beliefs , 2010, ASIST.

[20]  James J. Lindsay,et al.  The Accuracy-Confidence Correlation in the Detection of Deception , 1997, Personality and social psychology review : an official journal of the Society for Personality and Social Psychology, Inc.

[21]  J. Burgoon,et al.  Interpersonal Deception Theory , 1996 .

[22]  Noriko Kando,et al.  Certainty Identification in Texts: Categorization Model and Manual Tagging Results , 2023 .

[23]  Victoria L. Rubin,et al.  Discerning truth from deception: Human judgments and automation efforts , 2012, First Monday.

[24]  B. R. Schlenker Impression Management: The Self-Concept, Social Identity, and Interpersonal Relations , 1980 .

[25]  James J. Lindsay,et al.  Cues to deception. , 2003, Psychological bulletin.

[26]  Victoria L. Rubin Stating with Certainty or Stating with Doubt: Intercoder Reliability Results for Manual Annotation of Epistemically Modalized Statements , 2007, NAACL.

[27]  Victoria L. Rubin,et al.  Challenges in automated deception detection in computer-mediated communication , 2011, ASIST.

[28]  Jeffrey T. Hancock,et al.  On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication , 2007 .