@PhilosTEI: Building Corpora for Philosophers

For philosophers to be able to take a computational turn in their field, especially if that field relies heavily on historical material, it is crucial to be able to build high-quality, easily and freely accessible corpora in a sustainable format composed from multi-language, multi-script books from different historical periods. At the moment, corpora matching these needs are virtually non-existent. Within the CLARIN-NL project @PhilosTEI, we have addressed the problem of building this kind of corpora by developing an open-source, web-based, user-friendly workflow from textual images to TEI, based on state-of-the-art open-source OCR software Tesseract, and a multi-language version of TICCL, a powerful OCR post-correction tool. We have demonstrated the utility of the @PhilosTEI tool by applying it to a multilingual, multi-script corpus of important 18th to 20th century European philosophical texts.

[1]  Enrico Motta,et al.  PhiloSURFical: Browse Wittgenstein’s World with the Semantic Web , 2013 .

[2]  A. Betti,et al.  Modelling the History of Ideas , 2014 .

[3]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[4]  Aurélie Herbelot,et al.  Distributional techniques for philosophical enquiry , 2012, LaTeCH@EACL.

[5]  Martin Reynaert,et al.  FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[6]  Tomasz Parkoła,et al.  Report on the comparison of Tesseract and ABBYY FineReader OCR engines , 2012 .

[7]  Erik Duval,et al.  Studying the history of philosophical ideas: supporting research discovery, navigation, and awareness , 2014, i-KNOW '14.

[8]  Mathias Niepert,et al.  From encyclopedia to ontology: toward dynamic representation of the discipline of philosophy , 2011, Synthese.

[9]  J. Moor,et al.  The digital phoenix : how computers are changing philosophy , 1998 .

[10]  Martin Reynaert OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited , 2016, LREC.

[11]  Martin Reynaert,et al.  CLAM: Quickly deploy NLP command-line tools on the web , 2014, COLING.

[12]  Martin Reynaert TICCLops: Text-Induced Corpus Clean-up as online processing system , 2014, COLING.

[13]  Betti GlamMap: visualising library metadata , 2014 .

[14]  Jon Barwise,et al.  Computers, visualization, and the nature of reasoning , 1998 .

[15]  Charles Ess “Revolution? What Revolution?” Successes and Limits of Computing Technologies in Philosophy and Religion , 2007 .

[16]  Martin Volk,et al.  Reducing OCR Errors in Gothic-Script Documents , 2011, ERCIM News.

[17]  Luciano Floridi,et al.  What is the Philosophy of Information , 2002 .

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Alois Pichler,et al.  Sharing and debating Wittgenstein by using an ontology , 2013, Lit. Linguistic Comput..

[20]  Barry Smith,et al.  Foundations of an ontology of philosophy , 2011, Synthese.

[21]  Gottlob Frege,et al.  Begriffsschrift, eine der arithmetischen nachgebildete Formelsprache des reinen Denkens , 1879 .

[22]  A.P.J. van den Bosch,et al.  PICCL: Philosophical Integrator of Computational and Corpus Libraries , 2015 .

[23]  Alastair McKinnon From co-occurrences to concepts , 1977 .

[24]  Martin Reynaert On OCR ground truths and OCR post-correction gold standards, tools and formats , 2014, DATeCH '14.

[25]  Martin Reynaert Synergy of Nederlab and @Philos TEI: diachronic and multilingual Text- Induced Corpus Clean-up , 2014, LREC 2014.

[26]  Stefan Schlobach,et al.  Phil@Scale: Computational Methods within Philosophy , 2013, DHLU.

[27]  D. Woolley The White Paper. , 1972, British medical journal.

[28]  Terrell Ward Bynum,et al.  Computer ethics: Its birth and its future , 2001, Ethics and Information Technology.

[29]  Rico Sennrich,et al.  Reducing OCR errors by combining two OCR systems , 2010 .

[30]  Thomas M. Breuel The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[31]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).