BLN-Gram-TF-ITF as a new Feature for Authorship Identification

Authorship attribution is the methodology of determining the specific author of a text originating from a corpus of texts created by one or multiple authors. By analyzing and measuring specific textual features, one can construct a profile to identify a specific author. This type of identification of authors via their writing style is one of the most engaging problems for researchers of stylometry. As it relates to authorship identification, the primary goal of our research is to identify an author of text through analysis of stylistic traits. We present a new feature that demonstrates initial success in identifying correct authors of text through the analysis of those authors’ text corpus. Byte Level N-Gram Term Frequency Inverse Token Frequency (BLN-Gram-TF-ITF) is successfully implemented on a set of text corpus and shows promising outcomes.

[1]  Roman V. Yampolskiy,et al.  Linguistic Profiling and Behavioral Drift in Chat Bots , 2012, MAICS.

[2]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[3]  Amr Ahmed,et al.  Two-layer classification and distinguished representations of users and documents for grouping and authorship identification , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[4]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[5]  C. E. Veni Madhavan,et al.  Stopword Graphs and Authorship Attribution in Text Corpora , 2009, 2009 IEEE International Conference on Semantic Computing.

[6]  Yasumasa Kanada,et al.  Extraction of Authors' Charateristics from japanese Modern Setences via N-gram Distribution , 2000, Discovery Science.

[7]  Venu Govindaraju,et al.  Behavioural biometrics: a survey and classification , 2008, Int. J. Biom..

[8]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[9]  Ashok N. Srivastava,et al.  Data Mining: Concepts, Models, Methods, and Algorithms , 2005, J. Comput. Inf. Sci. Eng..

[10]  Mikhail B. Malyutov,et al.  Authorship attribution of texts: a review , 2005, Electron. Notes Discret. Math..

[11]  G. Zipf Selected Studies of the Principle of Relative Frequency in Language , 2014 .

[12]  Efstathios Stamatatos,et al.  Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[13]  Sharath Pankanti,et al.  BIOMETRIC IDENTIFICATION , 2000 .

[14]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[15]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[16]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[17]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[18]  A. Orebaugh An Instant Messaging Intrusion Detection System Framework: Using character frequency analysis for authorship identification and validation , 2006, Proceedings 40th Annual 2006 International Carnahan Conference on Security Technology.

[19]  Zhenyu Wu,et al.  Humans and Bots in Internet Chat: Measurement, Analysis, and Automated Classification , 2011, IEEE/ACM Transactions on Networking.

[20]  S. G. Efimovich,et al.  Automatic search of indicators of text authorship , 2003, 7th Korea-Russia International Symposium on Science and Technology, Proceedings KORUS 2003. (IEEE Cat. No.03EX737).

[21]  Roman V. Yampolskiy,et al.  Evaluation of authorship attribution software on a Chat bot corpus , 2011, 2011 XXIII International Symposium on Information, Communication and Automation Technologies.