Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

[1]  Kenneth E. Shirley,et al.  LDAvis: A method for visualizing and interpreting topics , 2014 .

[2]  Maciej Eder,et al.  Mind your corpus: systematic errors in authorship attribution , 2013, Lit. Linguistic Comput..

[3]  Maciej Eder,et al.  Does size matter? Authorship attribution, small samples, big problem , 2015, Digit. Scholarsh. Humanit..

[4]  D. Biber Methodological Issues Regarding Corpus-based Analyses of Linguistic Variation , 1990 .

[5]  Daniel McNamara,et al.  Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers , 2014, Digit. Humanit. Q..

[6]  Stan Lipovetsky Lexical Collocation Analysis: Advances and Applications , 2020, Technometrics.

[7]  Beatrice Alex,et al.  Digitised historical text: Does it have to be mediOCRe? , 2012, KONVENS.

[8]  Philip M. McCarthy,et al.  MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment , 2010, Behavior research methods.

[9]  Arthur Spirling,et al.  Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It , 2017, Political Analysis.

[10]  Paddy Bullard,et al.  Digital Humanities and Electronic Resources in the Long Eighteenth Century , 2013 .

[11]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[12]  John Burrows,et al.  All the Way Through: Testing for Authorship in Different Frequency Strata , 2007, Lit. Linguistic Comput..

[13]  David Mimno,et al.  Evaluating the Stability of Embedding-based Word Similarities , 2018, TACL.

[14]  David M. Mimno,et al.  Comparing Apples to Apple: The Effects of Stemmers on Topic Models , 2016, TACL.

[15]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[16]  Anke Lüdeling,et al.  Corpus Linguistics: An International Handbook , 2009 .

[17]  Mark Johnson,et al.  Unsupervised learning of multi-word verbs , 2001 .

[18]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[19]  P. Spedding "The New Machine": Discovering the Limits of ECCO , 2011 .

[20]  Stefan Evert,et al.  Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes , 2018 .

[21]  Mike Kestemont,et al.  Stylometry with R: A Package for Computational Text Analysis , 2016, R J..

[22]  Tony McEnery,et al.  Collocations in context:a new perspective on collocation networks , 2015 .

[23]  Klaus U. Schulz,et al.  PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts , 2014, DATeCH '14.

[24]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[25]  Tony McEnery,et al.  Collocations in Corpus‐Based Language Learning Research: Identifying, Comparing, and Interpreting the Evidence , 2017 .

[26]  Greta Franzini,et al.  Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm , 2018, Front. Digit. Humanit..

[27]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[28]  Peter de Bolla The Architecture of Concepts: The Historical Formation of Human Rights , 2013 .

[29]  Isabelle Boydens Informatique, normes et temps , 1999 .