Paragraph-based representation of texts: A complex networks approach

Abstract An interesting model to represent texts as a graph (also called network) is the word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts. In this study, we propose a novel network model, which is based on the similarity between the content of the paragraphs of the text. By considering this representation, we characterized the networks with respect to measurements developed in the network science area. We characterized these measurements according to their properties regarding their ability to discriminate between real and shuffled texts, and to capture information regarding the content similarity of chunks of text. In order to compare the results with a more sophisticated approach, we employed a methodology based on word2vec. When comparing real and shuffled texts, the results revealed that real texts tend to have a more well-defined community structure. This characteristic can be related to the organization of subjects in real texts. The network-based measurements that were found to be able to discriminate real from shuffled texts were used as features in a classifier. As a result, the obtained accuracy was 98.72%. In order to compare with a different methodology, we used doc2vec-based features in the classifier, yielding an accuracy rate of 70.8%. The proposed network-based features were employed to analyze the Voynich manuscript, which was found to be compatible with real texts according to the considered characteristics.

[1]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[2]  M. Montemurro,et al.  Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis , 2013, PloS one.

[3]  Luciano da Fontoura Costa,et al.  Using complex networks for text classification: Discriminating informative and imaginative documents , 2016 .

[4]  Kevin Knight,et al.  What We Know About The Voynich Manuscript , 2011, LaTeCH@ACL.

[5]  Kun Lu,et al.  Ranking themes on co-word networks: Exploring the relationships among different metrics , 2018, Inf. Process. Manag..

[6]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[7]  Cesar H. Comin,et al.  Concentric network symmetry , 2014, Inf. Sci..

[8]  Jian Wang,et al.  Deep hybrid collaborative filtering for Web service recommendation , 2018, Expert Syst. Appl..

[9]  Diego R. Amancio,et al.  A Complex Network Approach to Stylometry , 2015, PloS one.

[10]  Amir H. Darooneh,et al.  The complex networks approach for authorship attribution of books , 2012 .

[11]  Diego R. Amancio,et al.  Comparing the topological properties of real and artificially generated scientific manuscripts , 2015, Scientometrics.

[12]  I. Jolliffe Principal Component Analysis , 2002 .

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Grigori Sidorov,et al.  Application of the distributed document representation in the authorship attribution task for small corpora , 2017, Soft Comput..

[15]  Zoran Levnajic,et al.  Revealing the Hidden Language of Complex Networks , 2014, Scientific Reports.

[16]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Ricard V. Solé,et al.  Least effort and the origins of scaling in human language , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Luciano da Fontoura Costa,et al.  Topic segmentation via community detection in complex networks , 2015, Chaos.

[19]  Luciano da Fontoura Costa,et al.  The role of centrality for the identification of influential spreaders in complex networks , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[21]  Reinhard Köhler,et al.  Patterns in syntactic dependency networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[23]  Juyoung Kang,et al.  Analyzing the discriminative attributes of products using text mining focused on cosmetic reviews , 2018, Inf. Process. Manag..

[24]  S. Strogatz Exploring complex networks , 2001, Nature.

[25]  Bridget T. McInnes,et al.  Evaluating Feature Extraction Methods for Knowledge-based Biomedical Word Sense Disambiguation , 2017, BioNLP.

[26]  Luciano da Fontoura Costa,et al.  Supplementary Information-Identification of Literary Movements Using Complex Networks to Represent Texts , 2012 .

[27]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[28]  Ana Mestrovic,et al.  Network Differences between Normal and Shuffled Texts: Case of Croatian , 2014, CompleNet.

[29]  Grzegorz Kondrak,et al.  Decoding Anagrammed Texts Written in an Unknown Language and Script , 2016, Transactions of the Association for Computational Linguistics.

[30]  Luciano da Fontoura Costa,et al.  Comparing intermittency and network measurements of words and their dependence on authorship , 2011, ArXiv.

[31]  Haitao Liu,et al.  Approaching human language with complex networks. , 2014, Physics of life reviews.

[32]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[33]  Haitao Liu,et al.  Language clustering with word co-occurrence networks based on parallel texts , 2013 .

[34]  Rohini K. Srihari,et al.  Graph-based text representation and knowledge discovery , 2007, SAC '07.

[35]  G. J. Rodgers,et al.  Differences between Normal and Shuffled Texts: Structural Properties of Weighted Networks , 2008, Adv. Complex Syst..

[36]  Luciano da Fontoura Costa,et al.  Concentric network symmetry grasps authors' styles in word adjacency networks , 2015, ArXiv.

[37]  L. D. Costa,et al.  Accessibility in complex networks , 2008 .

[38]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[39]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[40]  Athena Vakali,et al.  Sentiment analysis leveraging emotions and word embeddings , 2017 .

[41]  Avi Arampatzis,et al.  A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis , 2018, Expert Syst. Appl..

[42]  Cesar H. Comin,et al.  A Systematic Comparison of Supervised Classifiers , 2013, PloS one.

[43]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[44]  Hahn-Ming Lee,et al.  Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques , 2018, Inf. Process. Manag..

[45]  Shuai Zhang,et al.  Hybrid self-optimized clustering model based on citation links and textual features to detect research topics , 2017, PloS one.

[46]  Mykola Pechenizkiy,et al.  Twitter rumour detection in the health domain , 2018, Expert Syst. Appl..

[47]  Jiajia Wang,et al.  Sentiment contagion in complex networks , 2014 .

[48]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[49]  Yibo Wang,et al.  Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud , 2018, Decis. Support Syst..

[50]  Mohand Boughanem,et al.  Using language models to improve opinion detection , 2018, Inf. Process. Manag..

[51]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[52]  Guilherme Alberto Wachs-Lopes,et al.  Analyzing natural human language from the point of view of dynamic of a complex network , 2016, Expert Syst. Appl..

[53]  Luciano da Fontoura Costa,et al.  Representation of texts as complex networks: a mesoscopic approach , 2016, J. Complex Networks.

[54]  Luciano da Fontoura Costa,et al.  On the “Calligraphy” of Books , 2017, TextGraphs@ACL.

[55]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[56]  G. J. Rodgers,et al.  Network properties of written human language. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[57]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[58]  Diego R. Amancio,et al.  Word sense disambiguation via high order of learning in complex networks , 2012, ArXiv.

[59]  Ana Mestrovic,et al.  Complex networks measures for differentiation between normal and shuffled Croatian texts , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[60]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[61]  Frank Harary,et al.  Graph Theory , 2016 .

[62]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[63]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[64]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[65]  Maria Bardosova,et al.  Using network science and text analytics to produce surveys in a scientific topic , 2015, J. Informetrics.

[66]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[67]  Diego R. Amancio,et al.  An image analysis approach to text analytics based on complex networks , 2018, Physica A: Statistical Mechanics and its Applications.

[68]  R. Belfield The Six Unsolved Ciphers: Inside the Mysterious Codes That Have Confounded the World's Greatest Cryptographers , 2007 .

[69]  Diego R. Amancio,et al.  Probing the Topological Properties of Complex Networks Modeling Short Written Texts , 2014, PloS one.

[70]  Luciano da Fontoura Costa,et al.  Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript , 2013, PloS one.

[71]  Luciano da Fontoura Costa,et al.  Extractive summarization using complex networks and syntactic dependency , 2012 .

[72]  V. Latora,et al.  Complex networks: Structure and dynamics , 2006 .

[73]  Santiago Segarra,et al.  Authorship Attribution Through Function Word Adjacency Networks , 2014, IEEE Transactions on Signal Processing.

[74]  Guoji Zhang,et al.  A balanced modularity maximization link prediction model in social networks , 2017, Inf. Process. Manag..

[75]  A. SalloumSaid,et al.  A survey of text mining in social media facebook and twitter perspectives , 2017 .