Using complex networks to quantify consistency in the use of words

In this paper we have quantified the consistency of word usage in written texts represented by complex networks, where words were taken as nodes, by measuring the degree of preservation of the node neighborhood. Words were considered highly consistent if the authors used them with the same neighborhood. When ranked according to the consistency of use, the words obeyed a log-normal distribution, in contrast to Zipf's law that applies to the frequency of use. Consistency correlated positively with the familiarity and frequency of use, and negatively with ambiguity and age of acquisition. An inspection of some highly consistent words confirmed that they are used in very limited semantic contexts. A comparison of consistency indices for eight authors indicated that these indices may be employed for author recognition. Indeed, as expected, authors of novels could be distinguished from those who wrote scientific texts. Our analysis demonstrated the suitability of the consistency indices, which can now be applied in other tasks, such as emotion recognition.

[1]  R. F. Cancho,et al.  The global minima of the communicative energy of natural communication systems , 2007 .

[2]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Constantine Kotropoulos,et al.  Long distance bigram models applied to word clustering , 2011, Pattern Recognit..

[5]  R. Blythe,et al.  Generic modes of consensus formation in stochastic language dynamics , 2008, 0812.3313.

[6]  M. J. BERRYMAN,et al.  Statistical techniques for text classification based on word recurrence intervals , 2003 .

[7]  A. Crofts,et al.  Structure and function of the -complex of , 1992 .

[8]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[9]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[10]  Lucas Antiqueira,et al.  COMPLEX NETWORKS ANALYSIS OF MANUAL AND MACHINE TRANSLATIONS , 2008 .

[11]  Jennifer E. Arnold,et al.  Heaviness vs. newness: The effects of structural complexity and discourse status on constituent ordering , 2015 .

[12]  Jorge Mira,et al.  The importance of interlinguistic similarity and stable bilingualism when two languages compete , 2010, 1006.2737.

[13]  S N Dorogovtsev,et al.  Language as an evolving word web , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[14]  M. Montemurro,et al.  Universal Entropy of Word Ordering Across Linguistic Families , 2011, PloS one.

[15]  Rada Mihalcea,et al.  Word Sense and Subjectivity , 2006, ACL.

[16]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[17]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[18]  Ricard V. Solé,et al.  Language networks: Their structure, function, and evolution , 2010 .

[19]  M. Newman,et al.  Vertex similarity in networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  C. B. Williams A NOTE ON THE STATISTICAL ANALYSIS OF SENTENCE-LENGTH AS A CRITERION OF LITERARY STYLE BY , 2008 .

[21]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[22]  Lucas Antiqueira,et al.  Using metrics from complex networks to evaluate machine translation , 2011 .

[23]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[24]  Ricard V. Solé,et al.  Language networks: Their structure, function, and evolution , 2007, Complex..

[25]  Luciano da Fontoura Costa,et al.  Comparing intermittency and network measurements of words and their dependence on authorship , 2011, ArXiv.

[26]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[27]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[28]  Ricard V. Solé,et al.  Least effort and the origins of scaling in human language , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  G. J. Rodgers,et al.  Network properties of written human language. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Filippo Petroni,et al.  Lexical evolution rates by automated stability measure , 2009, ArXiv.

[31]  L. Sachs Angewandte Statistik : Anwendung statistischer Methoden , 1984 .

[32]  Filippo Petroni,et al.  Malagasy dialects and the peopling of Madagascar , 2011, Journal of The Royal Society Interface.

[33]  P. Barabas,et al.  Clustering Based on Context Similarity , 2008, 2008 First International Conference on Complexity and Intelligence of the Artificial and Natural Complex Systems. Medical Applications of the Complex Systems. Biomedical Computing.

[34]  Lucas Antiqueira,et al.  Analyzing and modeling real-world phenomena with complex networks: a survey of applications , 2007, 0711.3199.

[35]  清水 邦夫 Continuous Univariate Distributions Volume 1/N.L.Johnson,S.Kotz,N.Balakrishnan(1994) , 1995 .

[36]  G. Svehla Angewandte Statistik — Anwendung statistischer Methoden Sechste Auflage. : Lothar Sachs, Springer Verlag, Berlin, Heidelberg, 1984, xxiv + 552 pp. , 1986 .

[37]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[38]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[39]  Ramon Ferrer The global minima of the communicative energy of natural communication systems , 2007 .

[40]  NOBUYASU ITOH Japanese language model based on bigrams and its application to on-line character recognition , 1995, Pattern Recognit..

[41]  Max Coltheart,et al.  The MRC Psycholinguistic Database , 1981 .

[42]  G. Herdan,et al.  THE RELATION BETWEEN THE DICTIONARY DISTRIBUTION AND THE OCCURRENCE DISTRIBUTION OF WORD LENGTH AND ITS IMPORTANCE FOR THE STUDY OF QUANTITATIVE LINGUISTICS , 1958 .

[43]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[44]  Mark M. Wilde,et al.  The information-theoretic costs of simulating quantum measurements , 2012, ArXiv.

[45]  Lucas Antiqueira,et al.  A complex network approach to text summarization , 2009, Inf. Sci..

[46]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[47]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[48]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .

[49]  Michael Oakes,et al.  Ant Colony Optimisation for Stylometry: The Federalist Papers. , 2004 .

[50]  Fiona I. B. Ngô,et al.  Sense and Subjectivity , 2011 .

[51]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[52]  Chi K. Tse,et al.  Comparison of co-occurrence networks of the Chinese and English languages , 2009 .

[53]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[54]  Vittorio Loreto,et al.  Journal of Statistical Mechanics: An IOP and SISSA journal Theory and Experiment Sharp transition towardsshared vocabularies in multi-agent systems , 2006 .

[55]  S. Tipper,et al.  Quarterly Journal of Experimental Psychology , 1948, Nature.

[56]  L. Powers,et al.  The Nature of the Interaction of Genes Affecting Four Quantitative Characters in a Cross between Hordeum Deficiens and Hordeum Vulgare. , 1936, Genetics.

[57]  Peter Nijkamp,et al.  Accessibility of Cities in the Digital Economy , 2004, cond-mat/0412004.

[58]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[59]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[60]  Kwang-Il Goh,et al.  Burstiness and memory in complex systems , 2006 .

[61]  E. W. Sinnott,et al.  The Relation of Gene to Character in Quantitative Inheritance. , 1937, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[63]  B H Groth THE "GOLDEN MEAN" IN THE INHERITANCE OF SIZE. , 1914, Science.

[64]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[65]  Reinhard Rapp Discovering the Senses of an Ambiguous Word by Clustering its Local Contexts , 2004, GfKl.

[66]  Rachel Greenstadt,et al.  Practical Attacks Against Authorship Recognition Techniques , 2009, IAAI.

[67]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[68]  Götz Trenkler Continuous univariate distributions : N.L. Johnson, S. Kotz and N. Balakrishnan Vol. 1, 2nd Edition. John Wiley, New York, 1994, pp. xix + 756, Price: [pound sign]66.00, ISBN 0-471-58495-9 , 1996 .