Authorship Attribution in Russian in Real-World Forensics Scenario

Recent demands in authorship attribution, specifically, cross-topic authorship attribution with small numbers of training samples and very short texts, impose new challenges on corpora design, feature and algorithm development. In the current work we address these challenges by performing authorship attribution on a specifically designed dataset in Russian. We present a dataset of short written texts in Russian, where both authorship and topic are controlled. We propose a pairwise classification design closely resembling a real-world forensic task. Semantic coherence features are introduced to supplement well-established n-gram features in challenging cross-topic settings. Distance-based measures are compared with machine learning algorithms. The experiment results support the intuition that for very small datasets, distance-based measures perform better than machine learning techniques. Moreover, pairwise classification results show that in difficult cross-topic cases, content-independent features, i.e., part-of-speech n-grams and semantic coherence, are promising. The results are supported by feature significance analysis for the proposed dataset.

[1]  Daniel Jurafsky,et al.  Automatic Detection of Incoherent Speech for Diagnosing Schizophrenia , 2018, CLPsych@NAACL-HTL.

[2]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[3]  Efstathios Stamatatos,et al.  Authorship Attribution for Social Media Forensics , 2017, IEEE Transactions on Information Forensics and Security.

[4]  Grigori Sidorov,et al.  Application of the distributed document representation in the authorship attribution task for small corpora , 2017, Soft Comput..

[5]  Carole E. Chaski The Keyboard Dilemma and Authorship Identification , 2007, IFIP Int. Conf. Digital Forensics.

[6]  Ilya Segalovich,et al.  A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine , 2003, MLMTA.

[7]  Isabella Reger,et al.  Understanding and explaining Delta measures for authorship attribution , 2017, Digit. Scholarsh. Humanit..

[8]  Tatiana Litvinova,et al.  On the Stability of Some Idiolectal Features , 2018, SPECOM.

[9]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[10]  G. Cecchi,et al.  Prediction of psychosis across protocols and risk cohorts using automated language analysis , 2018, World psychiatry : official journal of the World Psychiatric Association.

[11]  Andrey Kutuzov,et al.  WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models , 2016, AIST.

[12]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[13]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[14]  T. Goldberg,et al.  Quantifying incoherence in speech: An automated methodology and novel application to schizophrenia , 2007, Schizophrenia Research.

[15]  Darnes Vilariño Ayala,et al.  Hierarchical Clustering Analysis: The Best-Performing Approach at PAN 2017 Author Clustering Task , 2018, CLEF.

[16]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Sheila Queralt,et al.  The creation of Base Rate Knowledge of linguistic variables and the implementation of likelihood ratios to authorship attribution in forensic text comparison , 2019 .

[19]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection , 2018, CLEF.

[20]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[21]  David S. Ebert,et al.  Visualizing document authorship using n-grams and latent semantic indexing , 1997, NPIV '97.

[22]  Tatiana Litvinova,et al.  Assessing the Level of Stability of Idiolectal Features across Modes, Topics and Time of Text Production , 2018, 2018 23rd Conference of Open Innovations Association (FRUCT).

[23]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[24]  Efstathios Stamatatos,et al.  Masking topic‐related information to enhance authorship attribution , 2018, J. Assoc. Inf. Sci. Technol..

[25]  Ekaterina Kochmar,et al.  ‘Calling on the classical phone’: a distributional model of adjective-noun errors in learners’ English , 2016, COLING.

[26]  Tim D. Grant TXT 4N6:method, consistency, and distinctiveness in the analysis of sms text messages , 2013 .