The Royal Society Corpus: From Uncharted Data to Corpus

We present the Royal Society Corpus (RSC) built from the Philosophical Transactions and Proceedings of the Royal Society of London. At present, the corpus contains articles from the first two centuries of the journal (1665―1869) and amounts to around 35 million tokens. The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication. When building corpora from uncharted material, typically not all relevant meta-data (e.g. author, time, genre) or linguistic data (e.g. sentence/word boundaries, words, parts of speech) is readily available. We present an approach to obtain good quality meta-data and base text data adopting the concept of Agile Software Development.

[1]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[2]  Alan W Gross,et al.  Scientific Discourse in Sociohistorical Context: The Philosophical Transactions of the Royal Society of London, 1675-1975. Dwight Atkinson , 2001 .

[3]  Roman Klinger,et al.  Investigating the Relationship between Literary Genres and Emotional Plot Development , 2017, LaTeCH@ACL.

[4]  S. Piantadosi,et al.  Info/information theory: Speakers choose shorter words in predictive contexts , 2013, Cognition.

[5]  D. Atkinson Scientific discourse in sociohistorical context: The philosophical transactions of the Royal Society , 1998 .

[6]  Peter Fankhauser,et al.  Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers , 2014, LREC.

[7]  Peter Fankhauser,et al.  The linguistic construal of disciplinarity: A data‐mining approach using register features , 2016, J. Assoc. Inf. Sci. Technol..

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[10]  Matthew W. Crocker,et al.  Information Density and Linguistic Encoding (IDeaL) , 2015, KI - Künstliche Intelligenz.

[11]  Stefan Evert,et al.  Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium , 2011 .

[12]  JaatunMartin Gilje,et al.  Agile Software Development , 2002, Comput. Sci. Educ..

[13]  Andy Way,et al.  Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) , 2016 .

[14]  Begoña Crespo,et al.  Presenting the Coruña Corpus: a collection of samples for the historical study of English scientific writing , 2007 .

[15]  Michael Halliday,et al.  On the language of physical science , 2003 .

[16]  Jeannett Martin,et al.  Writing Science: Literacy And Discursive Power , 1993 .

[17]  Irma Taavitsainen,et al.  Medical texts in 1500–1700 and the corpus of Early Modern English Medical Texts , 2011 .

[18]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .