The lead medium of the humanities is text, but text with special characteristics that can be quite different from a normal monolingual article in most modern scripts. Text that can be derived from manuscripts, from retro digitization of previous scholarly publications such as critical editions and dictionaries, from books printed centuries ago, applying conventions no longer in force today. The keynote identifies four major challenges for recognizing humanities data: Unusual characters, unusual layouts, unusual semantics and unusual segmentations. Each challenge is illustrated with concrete examples taken from a variety of times and places, starting with cuneiform tablets, an extract from a Greek manuscript, a page from a multilingual critical edition, a renaissance print, a lemma from a scholarly dictionary, and some more. In addition, scholarly humanities data is typically marked up using domain-specific rich XML-based formats based on the TEI P5 guidelines. Any format that an OCR program produces must be sufficiently rich to permit for a mapping on TEI-compliant markup in order to be capable of reproducing the full richness of the original. A closer view at the Text Grid virtual research environment for the humanities and its Text-Image Link Editor (TBLE) demonstrates how scholars currently tackle these tasks. It analyzes where automatization can facilitate their task and enable new dimensions of research.
[1]
M.W. Kuster,et al.
TextGrid as a Digital Ecosystem
,
2007,
2007 Inaugural IEEE-IES Digital EcoSystems and Technologies Conference.
[2]
W. Boltz.
Early Chinese writing
,
1986
.
[3]
Dietmar Seipel,et al.
Schema and Variation: Digitizing Printed Dictionaries
,
2009,
Linguistic Annotation Workshop.
[4]
Peter Damerow,et al.
Informationsverarbeitung vor 5000 Jahren : frühe Schrift und Techniken der Wirtschaftsverwaltung im alten Vorderen Orient ; Informationsspeicherung und -verarbeitung vor 5000 Jahren
,
1991
.
[5]
Adriano Cappelli,et al.
Dizionario di abbreviature latine ed italiane
,
1912
.
[6]
Marc Wilhelm Küster.
Geordnetes Weltbild : die Tradition des alphabetischen Sortierens von der Keilschrift bis zur EDV : eine Kulturgeschichte
,
2006
.
[7]
S. Houston.
The first writing : script invention as history and process
,
2004
.
[8]
Thomas Selig,et al.
TextGrid provenance tools for digital humanities ecosystems
,
2011,
5th IEEE International Conference on Digital Ecosystems and Technologies (IEEE DEST 2011).