CCOHA: Clean Corpus of Historical American English

Modelling language change is an increasingly important area of interest within the fields of sociolinguistics and historical linguistics. In recent years, there has been a growing number of publications whose main concern is studying changes that have occurred within the past centuries. The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. This paper describes methods applied to the downloadable version of the COHA corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties. The resulting corpus CCOHA contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.

[1]  Simon Hengchen,et al.  Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change , 2019, ACL.

[2]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[3]  Adam Kilgarriff,et al.  Cleaneval: a Competition for Cleaning Web Pages , 2008, LREC.

[4]  Eiichiro Sumita,et al.  Bilingual corpus cleaning focusing on translation literality , 2002, INTERSPEECH.

[5]  Julie S. Amberg,et al.  Introduction: What is language? , 2009 .

[6]  Erik Velldal,et al.  Diachronic word embeddings and semantic shifts: a survey , 2018, COLING.

[7]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[8]  Simon Hengchen,et al.  Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study , 2019, Digit. Scholarsh. Humanit..

[9]  笠間 慎太郎,et al.  Google Books Ngram Viewerの歯科領域への応用 : 検索用語から時代の変遷を読む , 2016 .

[10]  A. Blank Why do new meanings occur? A cognitive typology of the motivations for lexical semantic change , 1999 .

[11]  Jim Q. Smith,et al.  GASC: Genre-Aware Semantic Change for Ancient Greek , 2019, LChange@ACL.

[12]  Xuri Tang,et al.  A state-of-the-art of semantic change computation , 2018, Natural Language Engineering.

[13]  Andrew M. Dai,et al.  Language-independent compound splitting with morphological operations , 2011, ACL.

[14]  Gertrud Faaß,et al.  SdeWaC - A Corpus of Parsable Sentences from the Web , 2013, GSCL.

[15]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[16]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[17]  Dominik Schlechtweg,et al.  A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains , 2019, ACL.

[18]  Martin Volk,et al.  Cleaning the Europarl Corpus for Linguistic Applications , 2014, KONVENS.

[19]  Claire Bowern,et al.  Semantic Change and Semantic Stability: Variation is Key , 2019, LChange@ACL.

[20]  Paul Rayson,et al.  The CLAWS Web Tagger , 1998 .

[21]  Martin Reynaert Corpus-Induced Corpus Clean-up , 2006, LREC.

[22]  Mark Davies Expanding horizons in historical linguistics with the 400-million word Corpus of Historical American English , 2012 .