Authorship Attribution of Internet Comments with Thousand Candidate Authors

In this paper we report the first authorship attribution results for the Lithuanian language using Internet comments with a thousand of candidate authors. The task is complicated due to the following reasons: large number of candidate authors, extremely short non-normative texts, and problems associated with morphologically and vocabulary rich language.

[1]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[2]  Jacques Savoy,et al.  Authorship attribution based on a probabilistic topic model , 2013, Inf. Process. Manag..

[3]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[4]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  S. Argamon,et al.  The “Fundamental Problem” of Authorship Attribution , 2012 .

[7]  Roy Schwartz,et al.  Authorship Attribution of Micro-Messages , 2013, EMNLP.

[8]  Songqing Chen,et al.  UNIK: unsupervised social network spam detection , 2013, CIKM.

[9]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[10]  Jurgita Kapociute-Dzikiene,et al.  The Effect of Author Set Size in Authorship Attribution for Lithuanian , 2015, NODALIDA.

[11]  Kim Luyckx,et al.  Scalability Issues in Authorship Attribution , 2011 .

[12]  Dawn Xiaodong Song,et al.  On the Feasibility of Internet-Scale Author Identification , 2012, 2012 IEEE Symposium on Security and Privacy.

[13]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[14]  Jurgita Kapociute-Dzikiene,et al.  Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches , 2014, TSD.

[15]  George K. Mikros,et al.  Investigating Topic Influence in Authorship Attribution , 2007, PAN.

[16]  Manuel Montes-y-Gómez,et al.  Modality Specific Meta Features for Authorship Attribution in Web Forum Posts , 2011, IJCNLP.

[17]  Justin Zobel,et al.  Effective and Scalable Authorship Attribution Using Function Words , 2005, AIRS.

[18]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[19]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[20]  Ingrid Zukerman,et al.  Authorship Attribution with Latent Dirichlet Allocation , 2011, CoNLL.

[21]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[22]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[23]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[24]  Hans van Halteren,et al.  New Machine Learning Methods Demonstrate the Existence of a Human Stylome , 2005, J. Quant. Linguistics.

[25]  Walter Daelemans,et al.  Improving Topic Classification for Highly Inflective Languages , 2012, International Conference on Computational Linguistics.

[26]  Fabio Crestani,et al.  Finding Participants in a Chat: Authorship Attribution for Conversational Documents , 2013, 2013 International Conference on Social Computing.

[27]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[28]  Eugénio C. Oliveira,et al.  'twazn me!!! ;(' Automatic Authorship Analysis of Micro-Blogging Messages , 2011, NLDB.

[29]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[30]  Patrick Juola,et al.  Future Trends in Authorship Attribution , 2007, IFIP Int. Conf. Digital Forensics.

[31]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[32]  Shlomo Argamon,et al.  Authorship Attribution: What's Easy and What's Hard? , 2013 .

[33]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[34]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[35]  Vittorio Murino,et al.  Conversationally-inspired stylometric features for authorship attribution in instant messaging , 2012, ACM Multimedia.

[36]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[37]  Jacques Savoy,et al.  Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages , 2012, J. Quant. Linguistics.

[38]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[39]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[40]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[41]  Lingyu Wang,et al.  OBA2: An Onion approach to Binary code Authorship Attribution , 2014, Digit. Investig..

[42]  Efstathios Stamatatos,et al.  Plagiarism detection using stopword n-grams , 2011, J. Assoc. Inf. Sci. Technol..

[43]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[44]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[45]  Hayato Yamana,et al.  A challenge of authorship identification for ten-thousand-scale microblog users , 2014, 2014 IEEE International Conference on Big Data (Big Data).