论文信息 - Semi-Automated XML Markup of Biosystematic Legacy Literature with the Goldengate Editor

Semi-Automated XML Markup of Biosystematic Legacy Literature with the Goldengate Editor

Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, we present and evaluate the GoldenGATE editor, which is designed for the special needs of marking up OCR output with XML. It is built in order to support the user in this process as far as possible: Its functionality ranges from easy, intuitive tagging through markup conversion to dynamic binding of configurable plug-ins provided by third parties. Our evaluation shows that marking up an OCR document using GoldenGATE is three to four times faster than with an off-the-shelf XML editor like XML-Spy. Using domain-specific NLP-based plug-ins, these numbers are even higher.

Klemens Böhm | Donat Agosti | Guido Sautter

[1] Klemens Böhm,et al. The Difficulties of Taxonomic Name Extraction and a Solution , 2006, BioNLP@NAACL-HLT.

[2] L. Rabiner,et al. An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[3] Declan Butler,et al. Mashups mix data into global service , 2006, Nature.

[4] Marc Moens,et al. Named Entity Recognition without Gazetteers , 1999, EACL.

[5] Christian Kohlschein. An introduction to Hidden Markov Models , 2007 .

[6] Indra Neil Sarkar,et al. Taxongrab: Extracting Taxonomic Names from Text , 2005 .

[7] C. Marshall. Encyclopedia of Life , 2008 .

[8] Donat Agosti. Encyclopedia of life: should species description equal gene sequence? , 2003 .