Using complex networks for text classification: Discriminating informative and imaginative documents

Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.

[1]  Luciano da Fontoura Costa,et al.  Supplementary Information-Identification of Literary Movements Using Complex Networks to Represent Texts , 2012 .

[2]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[3]  Luciano da Fontoura Costa,et al.  Comparing intermittency and network measurements of words and their dependence on authorship , 2011, ArXiv.

[4]  Harold L. Somers,et al.  An introduction to machine translation , 1992 .

[5]  Ernesto Estrada,et al.  The Structure of Complex Networks: Theory and Applications , 2011 .

[6]  Luciano da Fontoura Costa,et al.  The role of centrality for the identification of influential spreaders in complex networks , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[8]  Dragomir R. Radev,et al.  Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[9]  Richard C. Wilson,et al.  Characterizing graph symmetries through quantum Jensen-Shannon divergence. , 2013, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Diego R. Amancio,et al.  Word sense disambiguation via high order of learning in complex networks , 2012, ArXiv.

[11]  Witold Pedrycz,et al.  Data Mining: A Knowledge Discovery Approach , 2007 .

[12]  Luciano da Fontoura Costa,et al.  Complex networks analysis of language complexity , 2012, ArXiv.

[13]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[14]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[15]  P. Holme Detecting degree symmetries in networks. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Reinhard Köhler,et al.  Patterns in syntactic dependency networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[18]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[19]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[20]  Luciano da Fontoura Costa,et al.  Concentric network symmetry grasps authors' styles in word adjacency networks , 2015, ArXiv.

[21]  Haitao Liu,et al.  What role does syntax play in a language network , 2008 .

[22]  Diego R. Amancio,et al.  Authorship recognition via fluctuation analysis of network topology and word intermittency , 2015, ArXiv.

[23]  Luciano da Fontoura Costa,et al.  Unveiling the relationship between complex networks metrics and word senses , 2012, ArXiv.