论文信息 - @PhilosTEI: Building Corpora for Philosophers

@PhilosTEI: Building Corpora for Philosophers

For philosophers to be able to take a computational turn in their field, especially if that field relies heavily on historical material, it is crucial to be able to build high-quality, easily and freely accessible corpora in a sustainable format composed from multi-language, multi-script books from different historical periods. At the moment, corpora matching these needs are virtually non-existent. Within the CLARIN-NL project @PhilosTEI, we have addressed the problem of building this kind of corpora by developing an open-source, web-based, user-friendly workflow from textual images to TEI, based on state-of-the-art open-source OCR software Tesseract, and a multi-language version of TICCL, a powerful OCR post-correction tool. We have demonstrated the utility of the @PhilosTEI tool by applying it to a multilingual, multi-script corpus of important 18th to 20th century European philosophical texts.

[1] Enrico Motta,et al. PhiloSURFical: Browse Wittgenstein’s World with the Semantic Web , 2013 .

[2] A. Betti,et al. Modelling the History of Ideas , 2014 .

[3] Thomas M. Breuel,et al. The OCRopus open source OCR system , 2008, Electronic Imaging.

[4] Aurélie Herbelot,et al. Distributional techniques for philosophical enquiry , 2012, LaTeCH@EACL.

[5] Martin Reynaert,et al. FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[6] Tomasz Parkoła,et al. Report on the comparison of Tesseract and ABBYY FineReader OCR engines , 2012 .

[7] Erik Duval,et al. Studying the history of philosophical ideas: supporting research discovery, navigation, and awareness , 2014, i-KNOW '14.

[8] Mathias Niepert,et al. From encyclopedia to ontology: toward dynamic representation of the discipline of philosophy , 2011, Synthese.

[9] J. Moor,et al. The digital phoenix : how computers are changing philosophy , 1998 .

[10] Martin Reynaert. OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited , 2016, LREC.

[11] Martin Reynaert,et al. CLAM: Quickly deploy NLP command-line tools on the web , 2014, COLING.

[12] Martin Reynaert. TICCLops: Text-Induced Corpus Clean-up as online processing system , 2014, COLING.

[13] Betti. GlamMap: visualising library metadata , 2014 .

[14] Jon Barwise,et al. Computers, visualization, and the nature of reasoning , 1998 .

[15] Charles Ess. “Revolution? What Revolution?” Successes and Limits of Computing Technologies in Philosophy and Religion , 2007 .

[16] Martin Volk,et al. Reducing OCR Errors in Gothic-Script Documents , 2011, ERCIM News.

[17] Luciano Floridi,et al. What is the Philosophy of Information , 2002 .

[18] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19] Alois Pichler,et al. Sharing and debating Wittgenstein by using an ontology , 2013, Lit. Linguistic Comput..

[20] Barry Smith,et al. Foundations of an ontology of philosophy , 2011, Synthese.

[21] Gottlob Frege,et al. Begriffsschrift, eine der arithmetischen nachgebildete Formelsprache des reinen Denkens , 1879 .

[22] A.P.J. van den Bosch,et al. PICCL: Philosophical Integrator of Computational and Corpus Libraries , 2015 .

[23] Alastair McKinnon. From co-occurrences to concepts , 1977 .

[24] Martin Reynaert. On OCR ground truths and OCR post-correction gold standards, tools and formats , 2014, DATeCH '14.

[25] Martin Reynaert. Synergy of Nederlab and @Philos TEI: diachronic and multilingual Text- Induced Corpus Clean-up , 2014, LREC 2014.

[26] Stefan Schlobach,et al. Phil@Scale: Computational Methods within Philosophy , 2013, DHLU.

[27] D. Woolley. The White Paper. , 1972, British medical journal.

[28] Terrell Ward Bynum,et al. Computer ethics: Its birth and its future , 2001, Ethics and Information Technology.

[29] Rico Sennrich,et al. Reducing OCR errors by combining two OCR systems , 2010 .

[30] Thomas M. Breuel. The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[31] Martin Reynaert. Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).