Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Inarguably, curated databases such as the Encyclopedia of Life and NCBI Taxonomy are some of the most fundamental sources of information that is critical to understanding biodiversity. Another rich, albeit less exploited resource is biodiversity literature which provides possibly even more comprehensive information, considering that any significant findings have most likely been published in one form of writing or another: in reports, articles, books or monographs. However, unlike curated databases which provide information in a structured, readily computable form, literature collections are characterised by copious textual data expressed in natural language. This unstructured and voluminous nature of literature makes it difficult to find information of interest, thus posing a barrier to knowledge accessibility and discovery. The Biodiversity Heritage Library (BHL) is home to most of the world’s biodiversity legacy literature. In order to allow its users to find information in a more focussed and efficient manner, efforts towards the development of a semantically enabled search engine are currently underway. To this end, semantic metadata in the form of concept annotations has been automatically extracted over the BHL collection using text mining (TM) techniques. This was carried out in a series of stages: (1) producing a moderately sized BHL corpus in which concepts have been manually marked up and assigned semantic labels, e.g., taxon, location, anatomical entity, habitat; (2) training machine learning-based concept recognition models on the said corpus; (3) applying the trained models on BHL documents in order to automatically recognise and assign semantic labels to concepts; and (4) automatically linking together semantically related concepts using distributional similarity methods. BHL documents were then indexed according to the semantic annotations automatically generated by the above-described TM methodology. This facilitates the incorporation of the following system features into BHL’s search engine: (1) query expansion, which helps a user widen his search through automatic suggestion of synonyms; and (2) semantic facets, which the user can specify to narrow down search results in order to filter out documents pertaining to unwanted word senses.