Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts Using the TTLab Latin Tagger

The analysis of longitudinal corpora of historical texts requires the integrated development of tools for automatically preprocessing these texts and for building representation models of their genre- and register-related dynamics. In this chapter we present such a joint endeavor that ranges from resource formation via preprocessing to network-based text representation and classification. We start with presenting the so-called TTLab Latin Tagger (TLT) that preprocesses texts of classical and medieval Latin. Its lexical resource in the form of the Frankfurt Latin Lexicon (FLL) is also briefly introduced. As a first test case for showing the expressiveness of these resources, we perform a tripartite classification task of authorship attribution, genre detection and a combination thereof. To this end, we introduce a novel text representation model that explores the core structure (the so-called coreness) of lexical network representations of texts. Our experiment shows the expressiveness of this representation format and mediately of our Latin preprocessor.

[1]  Nils Diewald,et al.  Time Series of Linguistic Networks by Example of the Patrologia Latina , 2010 .

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Stephen B. Seidman,et al.  Network structure and minimum degree , 1983 .

[4]  Uwe Springmann,et al.  OCR of historical printings of Latin texts: problems, prospects, progress , 2014, DATeCH '14.

[5]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[6]  Alexander Mehler,et al.  Social Ontologies as Generalized Nearly Acyclic Directed Graphs: A Quantitative Graph Model of Social Tagging , 2011, Towards an Information Theory of Complex Networks.

[7]  Matthieu Constant,et al.  MWU-Aware Part-of-Speech Tagging with a CRF Model and Lexical Resources , 2011, MWE@ACL.

[8]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[9]  Comité du Cange Patrologia Latina Database , 1996 .

[10]  F. D. Saussure Cours de linguistique générale , 1924 .

[11]  Matthias Dehmer,et al.  Towards an Information Theory of Complex Networks - Statistical Methods and Applications , 2011 .

[12]  Luciano da Fontoura Costa,et al.  Supplementary Information-Identification of Literary Movements Using Complex Networks to Represent Texts , 2012 .

[13]  URSULA PIEPER DIFFERENZIERUNG VON TEXTEN NACH NUMERISCHEN KRITERIEN , 1975 .

[14]  Nils Diewald,et al.  Geography of social ontologies: Testing a variant of the Sapir-Whorf Hypothesis in the context of Wikipedia , 2011, Comput. Speech Lang..

[15]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[16]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[17]  Felice Dell'Orletta,et al.  Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin , 2010, LREC.

[18]  Alexander Mehler A Quantitative Graph Model of Social Ontologies by Example of Wikipedia , 2011 .

[19]  Philip M. McCarthy,et al.  MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment , 2010, Behavior research methods.

[20]  Dimitrios M. Thilikos,et al.  Evaluating Cooperation in Communities with the k-Core Structure , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[21]  Vladimir Batagelj,et al.  An O(m) Algorithm for Cores Decomposition of Networks , 2003, ArXiv.

[22]  William Gardner Hale,et al.  A Latin grammar , 1966 .

[23]  Gregory R. Crane,et al.  Building a digital library: the Perseus project as a case study in the humanities , 1996, DL '96.

[24]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[25]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[26]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[27]  Sitabhra Sinha,et al.  Core-Periphery Organization of Graphemes in Written Sequences: Decreasing Positional Rigidity with Increasing Core Order , 2012, CICLing.

[28]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[29]  Matthias Dehmer,et al.  Information processing in complex networks: Graph entropy and information functionals , 2008, Appl. Math. Comput..

[30]  Cornelis H. A. Koster Constructing a Parser for Latin , 2005, CICLing.

[31]  Efstathios Stamatatos Plagiarism detection based on structural information , 2011, CIKM '11.

[32]  David Bamman,et al.  The Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin , 2008, LREC.

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[34]  Gregory R. Crane,et al.  Towards a cultural heritage digital library , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[35]  Cong Wang,et al.  A Text Network Representation Model , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[36]  Leaving Behind the Less-Resourced Status. The Case of Latin through the Experience of the Index Thomisticus Treebank , 2010 .

[37]  Alexandra Ernst,et al.  A Corpus Management System for Historical Semantics , 2007 .

[38]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[39]  Alexander Mehler,et al.  STRUCTURAL SIMILARITIES OF COMPLEX NETWORKS: A COMPUTATIONAL MODEL BY EXAMPLE OF WIKI GRAPHS , 2008, Appl. Artif. Intell..

[40]  Alessandro Vespignani,et al.  K-core decomposition of Internet graphs: hierarchies, self-similarity and measurement biases , 2005, Networks Heterog. Media.

[41]  Matthias R. Mehl,et al.  Quantitative Text Analysis. , 2006 .

[42]  Matthias Dehmer,et al.  A history of graph entropy measures , 2011, Inf. Sci..

[43]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[44]  Lucas Antiqueira,et al.  COMPLEX NETWORKS ANALYSIS OF MANUAL AND MACHINE TRANSLATIONS , 2008 .

[45]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[46]  Yunsong Guo,et al.  Comparisons of sequence labeling algorithms and extensions , 2007, ICML '07.

[47]  W. Kintsch,et al.  Strategies of discourse comprehension , 1983 .

[48]  Marco Carlo Passarotti,et al.  Development and perspectives of the Latin morphological analyser LEMLAT , 2004 .

[49]  Thomas Eckart,et al.  Detection of Citations and Textual Reuse on Ancient Greek Texts and its Applications in the Classical Studies: eAQUA Project , 2010, DH.

[50]  Jeffrey A. Rydberg-Cox,et al.  The Perseus Project: a Digital Library for the Humanities , 2000 .

[51]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[52]  Simone Paolo Ponzetto,et al.  Knowledge-based graph document modeling , 2014, WSDM.