Authorship attribution based on Life-Like Network Automata

The authorship attribution is a problem of considerable practical and technical interest. Several methods have been designed to infer the authorship of disputed documents in multiple contexts. While traditional statistical methods based solely on word counts and related measurements have provided a simple, yet effective solution in particular cases; they are prone to manipulation. Recently, texts have been successfully modeled as networks, where words are represented by nodes linked according to textual similarity measurements. Such models are useful to identify informative topological patterns for the authorship recognition task. However, there is no consensus on which measurements should be used. Thus, we proposed a novel method to characterize text networks, by considering both topological and dynamical aspects of networks. Using concepts and methods from cellular automata theory, we devised a strategy to grasp informative spatio-temporal patterns from this model. Our experiments revealed an outperformance over structural analysis relying only on topological measurements, such as clustering coefficient, betweenness and shortest paths. The optimized results obtained here pave the way for a better characterization of textual networks.

[1]  Li Chen,et al.  Tri-Training for Authorship Attribution with Limited Training Data , 2014, ACL 2014.

[2]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[3]  Paul A. Watters,et al.  Recentred local profiles for authorship attribution , 2011, Natural Language Engineering.

[4]  Ralph Kenna,et al.  Universal properties of mythological networks , 2012, ArXiv.

[5]  Charo I. del Genio,et al.  Degree Correlations in Directed Scale-Free Networks , 2014, PloS one.

[6]  D. Watts,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2001 .

[7]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[8]  Diego R. Amancio,et al.  A Complex Network Approach to Stylometry , 2015, PloS one.

[9]  Amir H. Darooneh,et al.  The complex networks approach for authorship attribution of books , 2012 .

[10]  Luciano da Fontoura Costa,et al.  Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts , 2012, ArXiv.

[11]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[12]  Marco Tomassini,et al.  Evolution and Dynamics of Small-World Cellular Automata , 2005, Complex Syst..

[13]  Erzsébet Csuhaj-Varjú,et al.  Eco-grammar systems: a grammatical framework for studying lifelike interactions , 1997 .

[14]  Haitao Liu,et al.  Can syntactic networks indicate morphological complexity of a language , 2011 .

[15]  Odemir Martinez Bruno,et al.  Complex network classification using partially self-avoiding deterministic walks , 2011, Chaos.

[16]  Dragomir R. Radev,et al.  Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[17]  Ana Mestrovic,et al.  Multilayer Network of Language: a Unified Framework for Structural Analysis of Linguistic Subsystems , 2015, ArXiv.

[18]  Andrew Wuensche,et al.  The X-Rule: Universal Computation in a Non-Isotropic Life-Like Cellular Automaton , 2015, J. Cell. Autom..

[19]  Lucas Antiqueira,et al.  Analyzing and modeling real-world phenomena with complex networks: a survey of applications , 2007, 0711.3199.

[20]  Cyril Labbé,et al.  Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? , 2012, Scientometrics.

[21]  Diego R. Amancio,et al.  Authorship recognition via fluctuation analysis of network topology and word intermittency , 2015, ArXiv.

[22]  Ricard V. Solé,et al.  Language networks: Their structure, function, and evolution , 2007, Complex..

[23]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[24]  Filipi Nascimento Silva,et al.  A pattern recognition approach to complex networks , 2010 .

[25]  Luciano da Fontoura Costa,et al.  Comparing intermittency and network measurements of words and their dependence on authorship , 2011, ArXiv.

[26]  Michal Tomana,et al.  Influence of Word Normalization on Text Classification , 2007 .

[27]  S. N. Dorogovtsev,et al.  Evolution of networks , 2001, cond-mat/0106144.

[28]  Haitao Liu,et al.  What role does syntax play in a language network , 2008 .

[29]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[30]  Rachel Greenstadt,et al.  Practical Attacks Against Authorship Recognition Techniques , 2009, IAAI.

[31]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[32]  John R. Vacca,et al.  Computer Forensics: Computer Crime Scene Investigation (Networking Series) (Networking Series) , 2005 .

[33]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[34]  M E J Newman Assortative mixing in networks. , 2002, Physical review letters.

[35]  Lucas Antiqueira,et al.  Correlations between structure and random walk dynamics in directed complex networks , 2007, Applied physics letters.

[36]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[37]  Nick C Fox,et al.  Gene-Wide Analysis Detects Two New Susceptibility Genes for Alzheimer's Disease , 2014, PLoS ONE.

[38]  Li Chen,et al.  Tri-Training for Authorship Attribution with Limited Training Data , 2014, ACL.

[39]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[40]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[41]  Santiago Segarra,et al.  Authorship Attribution Through Function Word Adjacency Networks , 2014, IEEE Transactions on Signal Processing.

[42]  Luciano da Fontoura Costa,et al.  Concentric network symmetry grasps authors' styles in word adjacency networks , 2015, ArXiv.

[43]  Gordon Broderick,et al.  A life-like virtual cell membrane using discrete automata , 2004, Silico Biol..

[44]  Hans Van Halteren,et al.  Author verification by linguistic profiling: An exploration of the parameter space , 2007, TSLP.

[45]  Marie-Jeanne Lesot,et al.  Similarity measures for binary and numerical data: a survey , 2008, Int. J. Knowl. Eng. Soft Data Paradigms.

[46]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[47]  C. E. Veni Madhavan,et al.  Stopword Graphs and Authorship Attribution in Text Corpora , 2009, 2009 IEEE International Conference on Semantic Computing.

[48]  Luciano da Fontoura Costa,et al.  Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript , 2013, PloS one.

[49]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[50]  Luciano da Fontoura Costa,et al.  Complex networks analysis of language complexity , 2012, ArXiv.

[51]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[52]  Derek Abbott,et al.  Automated Authorship Attribution Using Advanced Signal Classification Techniques , 2013, PloS one.

[53]  Cesar H. Comin,et al.  A Systematic Comparison of Supervised Classifiers , 2013, PloS one.

[54]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[55]  Lucas Antiqueira,et al.  Using metrics from complex networks to evaluate machine translation , 2011 .

[56]  Ingrid Zukerman,et al.  Authorship Attribution with Topic Models , 2014, CL.

[57]  Luciano da Fontoura Costa,et al.  Using complex networks concepts to assess approaches for citations in scientific papers , 2012, Scientometrics.

[58]  Master Gardener,et al.  Mathematical games: the fantastic combinations of john conway's new solitaire game "life , 1970 .

[59]  Satoru Morita,et al.  Six Susceptible-Infected-Susceptible Models on Scale-free Networks , 2015, Scientific Reports.

[60]  Jie Wu,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2003 .

[61]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[62]  Odemir Martinez Bruno,et al.  Exploring Spatio-temporal Dynamics of Cellular Automata for Pattern Recognition in Networks , 2016, Scientific Reports.

[63]  Santiago Segarra,et al.  Authorship attribution using function words adjacency networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  Zhi-Hong Guan,et al.  An epidemic spreading model on adaptive scale-free networks with feedback mechanism , 2016 .

[65]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[66]  Marc-Thorsten Hütt,et al.  Cellular Automata on Graphs: Topological Properties of ER Graphs Evolved towards Low-Entropy Dynamics , 2012, Entropy.

[67]  Odemir Martinez Bruno,et al.  Chaotic encryption method based on life-like cellular automata , 2011, Expert Syst. Appl..

[68]  Stephen Wolfram,et al.  Universality and complexity in cellular automata , 1983 .

[69]  S N Dorogovtsev,et al.  Language as an evolving word web , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.