Representation of texts as complex networks: a mesoscopic approach

Statistical techniques that analyze texts, referred to as text analytics, have departed from the use of simple word count statistics towards a new paradigm. Text mining now hinges on a more sophisticated set of methods, including the representations in terms of complex networks. While well-established word-adjacency (co-occurrence) methods successfully grasp syntactical features of written texts, they are unable to represent important aspects of textual data, such as its topical structure, i.e. the sequence of subjects developing at a mesoscopic level along the text. Such aspects are often overlooked by current methodologies. In order to grasp the mesoscopic characteristics of semantical content in written texts, we devised a network model which is able to analyze documents in a multi-scale fashion. In the proposed model, a limited amount of adjacent paragraphs are represented as nodes, which are connected whenever they share a minimum semantical content. To illustrate the capabilities of our model, we present, as a case example, a qualitative analysis of "Alice's Adventures in Wonderland". We show that the mesoscopic structure of a document, modeled as a network, reveals many semantic traits of texts. Such an approach paves the way to a myriad of semantic-based applications. In addition, our approach is illustrated in a machine learning context, in which texts are classified among real texts and randomized instances.

[1]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[2]  J-P Eckmann,et al.  Hierarchical structures induce long-range dynamical correlations in written texts. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Yamir Moreno,et al.  Dynamics of rumor spreading in complex networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Vassilios Constantoudis,et al.  Word-length Entropies and Correlations of Natural Language Written Texts , 2014, J. Quant. Linguistics.

[5]  Luciano da Fontoura Costa,et al.  Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript , 2013, PloS one.

[6]  Ana L. C. Bazzan,et al.  Temporal Network Analysis of Literary Texts , 2016, Adv. Complex Syst..

[7]  S. Severini,et al.  The Laplacian of a Graph as a Density Matrix: A Basic Combinatorial Approach to Separability of Mixed States , 2004, quant-ph/0406165.

[8]  Luciano da Fontoura Costa,et al.  Extractive summarization using complex networks and syntactic dependency , 2012 .

[9]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Raymond J. Mooney,et al.  Text mining with information extraction , 2004 .

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Mauro Miazaki,et al.  A framework for analyzing the relationship between gene expression and morphological, topological, and dynamical patterns in neuronal networks , 2015, Journal of Neuroscience Methods.

[13]  Shilpa Chakravartula,et al.  Complex Networks: Structure and Dynamics , 2014 .

[14]  Santiago Segarra,et al.  Authorship Attribution Through Function Word Adjacency Networks , 2014, IEEE Transactions on Signal Processing.

[15]  Peng Hao,et al.  Authorship Similarity Detection from Email Messages , 2011, MLDM.

[16]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[18]  Olaf Sporns,et al.  Graph Theory Methods for the Analysis of Neural Connectivity Patterns , 2003 .

[19]  Luciano da Fontoura Costa,et al.  Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts , 2012, ArXiv.

[20]  Reinhard Köhler,et al.  Patterns in syntactic dependency networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[24]  Ernesto Estrada,et al.  The Structure of Complex Networks: Theory and Applications , 2011 .

[25]  Marcus Kaiser,et al.  Edge vulnerability in neural and metabolic networks , 2004, Biological Cybernetics.

[26]  Ottavio Arancio,et al.  An Intracellular Threonine of Amyloid-β Precursor Protein Mediates Synaptic Plasticity Deficits and Memory Loss , 2013, PloS one.

[27]  Luciano da Fontoura Costa,et al.  Using complex networks for text classification: Discriminating informative and imaginative documents , 2016 .

[28]  Diego R. Amancio,et al.  A Complex Network Approach to Stylometry , 2015, PloS one.

[29]  Amir H. Darooneh,et al.  The complex networks approach for authorship attribution of books , 2012 .

[30]  M. Dixit,et al.  Tata McGraw Hill Education Private Limited , 2015 .

[31]  Lisa Singh,et al.  Overlapping Target Event and Story Line Detection of Online Newspaper Articles , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[32]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[33]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[34]  Mengchen Liu,et al.  StoryFlow: Tracking the Evolution of Stories , 2013, IEEE Transactions on Visualization and Computer Graphics.

[35]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[36]  Diego R. Amancio,et al.  Word sense disambiguation via high order of learning in complex networks , 2012, ArXiv.

[37]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[38]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[39]  Luciano da Fontoura Costa,et al.  The role of centrality for the identification of influential spreaders in complex networks , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[40]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[41]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[42]  G. J. Rodgers,et al.  Network properties of written human language. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[43]  Jen-Tzung Chien,et al.  Latent Dirichlet learning for document summarization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Lucas Antiqueira,et al.  A complex network approach to text summarization , 2009, Inf. Sci..

[45]  Adilson E. Motter,et al.  Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[46]  I. Gutman,et al.  Laplacian energy of a graph , 2006 .

[47]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[48]  Christopher M. Danforth,et al.  The emotional arcs of stories are dominated by six basic shapes , 2016, EPJ Data Science.

[49]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[50]  Luciano da Fontoura Costa,et al.  Unveiling the relationship between complex networks metrics and word senses , 2012, ArXiv.

[51]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[52]  Kwan-Liu Ma,et al.  Design Considerations for Optimizing Storyline Visualizations , 2012, IEEE Transactions on Visualization and Computer Graphics.

[53]  Andrzej Kulig,et al.  Modeling the average shortest-path length in growth of word-adjacency networks. , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[54]  Werner Ebeling,et al.  Long-range correlations between letters and sentences in texts , 1995 .

[55]  Luciano da Fontoura Costa,et al.  Topic segmentation via community detection in complex networks , 2015, Chaos.

[56]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[57]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[58]  Qi Xuan,et al.  Node matching between complex networks. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[59]  Ronen Feldman,et al.  Techniques and applications for sentiment analysis , 2013, CACM.

[60]  Ernesto Estrada Quantifying network heterogeneity. , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[61]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[62]  Luciano da Fontoura Costa,et al.  Comparing intermittency and network measurements of words and their dependence on authorship , 2011, ArXiv.

[63]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[64]  Long Sheng,et al.  English and Chinese languages as weighted complex networks , 2009 .

[65]  Rada Mihalcea,et al.  PageRank on Semantic Networks, with Application to Word Sense Disambiguation , 2004, COLING.

[66]  Diego R. Amancio,et al.  Authorship recognition via fluctuation analysis of network topology and word intermittency , 2015, ArXiv.

[67]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[68]  L. D. Costa,et al.  Accessibility in complex networks , 2008 .

[69]  M. Cugmas,et al.  On comparing partitions , 2015 .

[70]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .