Predicting Authorship and Author Traits from Keystroke Dynamics

Written text transmits a good deal of nonverbal information related to the author’s identity and social factors, such as age, gender and personality. However, it is less known to what extent behavioral biometric traces transmit such information. We use typist data to study the predictiveness of authorship, and present first experiments on predicting both age and gender from keystroke dynamics. Our results show that the model based on keystroke features, while being two orders of magnitude smaller, leads to significantly higher accuracies for authorship than the text-based system. For user attribute prediction, the best approach is to combine the two, suggesting that extralinguistic factors are disclosed to a larger degree in written text, while author identity is better transmitted in typing behavior.

[1]  Tomaz Erjavec,et al.  Language-independent Gender Prediction on Twitter , 2017, NLP+CSS@ACL.

[2]  Eva Lindgren,et al.  Computer keystroke logging and writing: methods and applications , 2006 .

[3]  Dirk Hovy,et al.  The Social Impact of Natural Language Processing , 2016, ACL.

[4]  Åsa Wengelin,et al.  Examining Pauses in Writing: Theory, Methods and Empirical Data , 2006, Computer Key-Stroke Logging and Writing.

[5]  Dirk Hovy,et al.  Demographic Factors Improve Classification Performance , 2015, ACL.

[6]  Malvina Nissim,et al.  Bleaching Text: Abstract Features for Cross-lingual Gender Prediction , 2018, ACL.

[7]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[8]  Lucia Specia,et al.  Personalized Machine Translation: Preserving Original Author Traits , 2016, EACL.

[9]  Sabien Hanoulle,et al.  The translation of documentaries : can terminology-extraction systems reduce the translator's workload? An experiment involving professional translators. , 2015 .

[10]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[11]  Sung-Hyuk Cha,et al.  An investigation of keystroke and stylometry traits for authenticating online test takers , 2011, 2011 International Joint Conference on Biometrics (IJCB).

[12]  Vir V. Phoha,et al.  Continuous authentication with cognition-centric text production and revision features , 2014, IEEE International Joint Conference on Biometrics.

[13]  Sung-Hyuk Cha,et al.  A Keystroke Biometric System for Long-Text Input , 2013 .

[14]  Sung-Hyuk Cha,et al.  Behavioral biometric verification of student identity in online course assessment and authentication of authors in literary works , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[15]  Carolyn Penstein Rosé,et al.  Computational Sociolinguistics: A Survey , 2016, Computational Linguistics.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[18]  Dirk Hovy,et al.  Personality Traits on Twitter—or—How to Get 1,500 Personality Tests in a Week , 2015, WASSA@EMNLP.

[19]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[20]  Dirk Hovy,et al.  Tagging Performance Correlates with Author Age , 2015, ACL.

[21]  Yejin Choi,et al.  Keystroke Patterns as Prosody in Digital Writings: A Case Study with Deceptive Reviews and Essays , 2014, EMNLP.

[22]  Andrew Rosenberg,et al.  Muddying The Multiword Expression Waters: How Cognitive Demand Affects Multiword Expression Production , 2015, MWE@NAACL-HLT.

[23]  Walter Daelemans,et al.  TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling , 2016, LREC.

[24]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[25]  Vir V. Phoha,et al.  Utilizing linguistically enhanced keystroke dynamics to predict typist cognition and demographics , 2015, Int. J. Hum. Comput. Stud..

[26]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[27]  Barbara Plank,et al.  Profiling Dutch Authors on Twitter: Discovering Political Preference and Income Level , 2017 .

[28]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[29]  Rüdiger Weingarten,et al.  From written word to written sentence production. , 2007 .

[30]  Sung-Hyuk Cha,et al.  Keystroke Biometric Identification and Authentication on Long-Text Input , 2010 .

[31]  Fei Liu,et al.  A Recurrent and Compositional Model for Personality Trait Recognition from Short Texts , 2016, PEOPLES@COLING.

[32]  Lyle H. Ungar,et al.  Exploring Stylistic Variation with Age and Income on Twitter , 2016, ACL.

[33]  Lyle H. Ungar,et al.  Analyzing Biases in Human Perception of User Age and Gender from Text , 2016, ACL.

[34]  John J. Leggett,et al.  Verifying Identity via Keystroke Characteristics , 1988, Int. J. Man Mach. Stud..

[35]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[36]  Luuk Van Waes,et al.  Keystroke logging in writing research: observing writing processes with Inputlog , 2009 .

[37]  Caroline Brun,et al.  Motivating Personality-aware Machine Translation , 2015, EMNLP.

[38]  Dirk Hovy,et al.  Multitask Learning for Mental Health Conditions with Limited Social Media Data , 2017, EACL.

[39]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[40]  Blake Lemoine,et al.  Mitigating Unwanted Biases with Adversarial Learning , 2018, AIES.

[41]  Malvina Nissim,et al.  N-GrAM: New Groningen Author-profiling Model , 2017, CLEF.

[42]  J. Pennebaker Using computer analyses to identify language style and aggressive intent: The secret life of function words , 2011 .

[43]  John Nerbonne,et al.  The Exact Analysis of Text , 2007 .

[44]  Marilyn A. Walker,et al.  Automatic Recognition of Personality in Conversation , 2006, NAACL.

[45]  Yukino Baba,et al.  How Are Spelling Errors Generated and Corrected? A Study of Corrected and Uncorrected Spelling Errors Using Keystroke Logs , 2012, ACL.

[46]  Roy Schwartz,et al.  Authorship Attribution of Micro-Messages , 2013, EMNLP.

[47]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[48]  Barbara Plank Keystroke dynamics as signal for shallow syntactic parsing , 2016, COLING.

[49]  Vasilios Katos,et al.  Language-independent gender identification through keystroke analysis , 2015, Inf. Comput. Secur..

[50]  Walter Daelemans,et al.  Personae: a Corpus for Author and Personality Prediction from Text , 2008, LREC.

[51]  Walter Daelemans,et al.  Simple Queries as Distant Labels for Predicting Gender on Twitter , 2017, NUT@EMNLP.

[52]  Veerle M. Baaijen,et al.  Keystroke Analysis , 2012 .