A Complex Network Approach to Stylometry

Statistical methods have been widely employed to study the fundamental properties of language. In recent years, methods from complex and dynamical systems proved useful to create several language models. Despite the large amount of studies devoted to represent texts with physical models, only a limited number of studies have shown how the properties of the underlying physical systems can be employed to improve the performance of natural language processing tasks. In this paper, I address this problem by devising complex networks methods that are able to improve the performance of current statistical methods. Using a fuzzy classification strategy, I show that the topological properties extracted from texts complement the traditional textual description. In several cases, the performance obtained with hybrid approaches outperformed the results obtained when only traditional or networked methods were used. Because the proposed model is generic, the framework devised here could be straightforwardly used to study similar textual applications where the topology plays a pivotal role in the description of the interacting agents.

[1]  Kwang-Il Goh,et al.  Burstiness and memory in complex systems , 2006 .

[2]  Reinhard Köhler,et al.  Patterns in syntactic dependency networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Juan Martínez-Romo,et al.  Disentangling categorical relationships through a graph of co-occurrences. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Piero P. Bonissone,et al.  A fuzzy random forest , 2010, Int. J. Approx. Reason..

[6]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[7]  Paolo Allegrini,et al.  Intermittency and scale-free networks: a dynamical model for human language complexity , 2003, cond-mat/0310648.

[8]  Diego R. Amancio,et al.  Probing the Topological Properties of Complex Networks Modeling Short Written Texts , 2014, PloS one.

[9]  M. Newman,et al.  Mixing patterns in networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Luciano da Fontoura Costa,et al.  Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript , 2013, PloS one.

[11]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[12]  Luciano da Fontoura Costa,et al.  Complex networks analysis of language complexity , 2012, ArXiv.

[13]  Luciano da Fontoura Costa,et al.  Extractive summarization using complex networks and syntactic dependency , 2012 .

[14]  L BergerAdam,et al.  A maximum entropy approach to natural language processing , 1996 .

[15]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[17]  Luciano da Fontoura Costa,et al.  Unveiling the relationship between complex networks metrics and word senses , 2012, ArXiv.

[18]  Huey-Wen Yien,et al.  Information categorization approach to literary authorship disputes , 2003 .

[19]  Lourdes Araujo,et al.  Local-Based Semantic Navigation on a Networked Representation of Information , 2012, PloS one.

[20]  Lucas Antiqueira,et al.  Strong correlations between text quality and complex networks features , 2007 .

[21]  Vittorio Loreto,et al.  Complex Structures and Semantics in Free Word Association , 2012, Adv. Complex Syst..

[22]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[23]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[24]  G. Tapang,et al.  PROSE AND POETRY CLASSIFICATION AND BOUNDARY DETECTION USING WORD ADJACENCY NETWORK ANALYSIS , 2010 .

[25]  Diego R. Amancio,et al.  Authorship recognition via fluctuation analysis of network topology and word intermittency , 2015, ArXiv.

[26]  Louis Wehenkel,et al.  A complete fuzzy decision tree technique , 2003, Fuzzy Sets Syst..

[27]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[28]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[29]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[30]  Ricard V. Solé,et al.  Least effort and the origins of scaling in human language , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[31]  S. Havlin The distance between Zipf plots , 1995 .

[32]  By Bei,et al.  An Evaluation of Text Classification Methods for Literary Study , 2022 .

[33]  Magdalena Jankowska,et al.  Relative N-gram signatures: Document visualization at the level of character N-grams , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[34]  Lucas Antiqueira,et al.  Some issues on complex networks for author characterization , 2007, Inteligencia Artif..

[35]  Andrzej Kulig,et al.  Modeling the average shortest-path length in growth of word-adjacency networks. , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[36]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[37]  Tao Zhou,et al.  Deviation of Zipf's and Heaps' Laws in Human Languages with Limited Dictionary Sizes , 2013, Scientific reports.

[38]  Cesar H. Comin,et al.  A Systematic Comparison of Supervised Classifiers , 2013, PloS one.

[39]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[40]  Pedro A. Pury,et al.  Statistical keyword detection in literary corpora , 2007, ArXiv.

[41]  Ricard V. Solé,et al.  The Ontogeny of Scale-Free Syntax Networks: Phase Transitions in Early Language Acquisition , 2009, Adv. Complex Syst..

[42]  L. D. Costa,et al.  Accessibility in complex networks , 2008 .

[43]  Sheng-De Wang,et al.  Fuzzy support vector machines , 2002, IEEE Trans. Neural Networks.

[44]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[45]  Dragomir R. Radev,et al.  Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[46]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[47]  Sankar K. Pal,et al.  Multilayer perceptron, fuzzy sets, and classification , 1992, IEEE Trans. Neural Networks.

[48]  Jean Véronis,et al.  HyperLex: lexical cartography for information retrieval , 2004, Comput. Speech Lang..

[49]  Lucas Antiqueira,et al.  COMPLEX NETWORKS ANALYSIS OF MANUAL AND MACHINE TRANSLATIONS , 2008 .

[50]  Diego R. Amancio,et al.  Word sense disambiguation via high order of learning in complex networks , 2012, ArXiv.

[51]  Nick Chater,et al.  Networks in Cognitive Science , 2013, Trends in Cognitive Sciences.

[52]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[53]  Raja Kali The city as a giant component: a random graph approach to Zipf's law , 2003 .

[54]  R. Ferrer-i-Cancho,et al.  The Evolution of the Exponent of Zipf's Law in Language Ontogeny , 2013, PloS one.

[55]  Luciano da Fontoura Costa,et al.  Supplementary Information-Identification of Literary Movements Using Complex Networks to Represent Texts , 2012 .

[56]  Luciano da Fontoura Costa,et al.  Concentric characterization and classification of complex network nodes: Application to an institutional collaboration network , 2008 .

[57]  Baruch Vilensky,et al.  Can analysis of word frequency distinguish between writings of different authors , 1996 .

[58]  Lucas Antiqueira,et al.  A complex network approach to text summarization , 2009, Inf. Sci..

[59]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[60]  Luciano da Fontoura Costa,et al.  Comparing intermittency and network measurements of words and their dependence on authorship , 2011, ArXiv.

[61]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[62]  Haitao Liu,et al.  Approaching human language with complex networks. , 2014, Physics of life reviews.

[63]  P. Carpena,et al.  Level statistics of words: finding keywords in literary texts and symbolic sequences. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[64]  G. J. Rodgers,et al.  Network properties of written human language. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[65]  Dominic Widdows,et al.  A Graph Model for Unsupervised Lexical Acquisition , 2002, COLING.

[66]  Michael Oakes,et al.  Ant Colony Optimisation for Stylometry: The Federalist Papers. , 2004 .

[67]  Benjamin B. Bederson,et al.  A review of overview+detail, zooming, and focus+context interfaces , 2009, CSUR.

[68]  Lucas Antiqueira,et al.  Using metrics from complex networks to evaluate machine translation , 2011 .

[69]  Fazli Can,et al.  A Stylometric Analysis of Yaşar Kemal’s İnce Memed Tetralogy , 2004, Comput. Humanit..

[70]  Bei Yu,et al.  An evaluation of text classification methods for literary study , 2008, Lit. Linguistic Comput..