Enriching the legacy literature with OCR corrections and text-mined semantic metadata

The Biodiversity Heritage Library (BHL) holds the largest collection of digitised legacy literature on biodiversity. Accessible as an online, fully-featured digital library, BHL stores bibliographic metadata for digital objects, allowing its users to issue keyword-based searches over the entire collection. Furthermore, owing to the application of optical character recognition (OCR) technology on scanned items (e.g., books, monographs, journals), textual content has been made available in machine-readable form, as well as automatically linked to taxonomic names in the Encyclopedia of Life (EOL). In the work presented herein, we report on our recent efforts aimed at the further advancement of the above-mentioned BHL functionalities. In terms of content rectification, the quality of available texts is being improved through the detection and correction of OCR-generated errors using an unsupervised statistical procedure incorporated into a desktop tool. In developing this tool, we are utilising the Google Books Ngram data sets as well as the accompanying Google Ngram Viewer. We are investigating two methods for error correction: lexical distance-based and context-based approaches. The former determines the best candidate unigram given only the features of an erroneous word. Context-based correction, in contrast, takes into account a word’s surrounding context. Meanwhile, in order to extend the current BHL features with semantic search capabilities, we are employing text mining solutions to automatically extract semantic metadata that capture a wide range of concepts apart from taxa. To this end, natural language processing (NLP) pipelines have been constructed using Argo ( http://argo.nactem.ac.uk ), a Web-based, graphical text mining workbench, in order to identify other biodiversity-relevant concepts, such as expressions pertaining to people, geographic locations, habitats, morphological characteristics and time. These pipelines, i.e., workflows, are built through the straightforward combination of several analytics (e.g., gazetteers and machine learning-based concept recognisers) which have been developed specifically for the biodiversity domain. The generated semantic metadata are displayed by the workbench’s graphical user interface that allows for the validation of annotations. Argo’s support for information interoperability is two-fold: firstly, its workflows can store their results in any of a number of standard encodings, e.g., XML Metadata Interchange (XMI) and Resource Description Framework (RDF) formats. Secondly, Argo includes facilities for deploying any of its workflows as Representational State Transfer (RESTful) Web services, thus rendering our NLP tools integrable with third-party applications similarly intending to enrich free-text biodiversity resources with automatically generated semantic metadata. Finally, to facilitate exploration and understanding of the documents and metadata which will be retrieved by semantic search, appropriate information visualisations are being designed. A key aspect of this design is driven by the need to allow users to interact with the visualisations in an analytical yet intuitive manner, enabling technical and non-technical users alike to discover and access various associations amongst BHL digital objects.