Mining Information Extraction Models for HmtDB annotation

Advances of genome sequencing techniques have risen an overwhelming increase in the literature on discovered genes, proteins and their role in biological processes. However, the biomedical literature remains a greatly unexploited source of biological information. Information extraction (IE) techniques are necessary to map this information into structured representations that allow facts relating domain-relevant entities to be automatically recognized. In this paper, we present a framework that supports biologists in the task of automatic extraction of information from texts. The framework integrates a data mining module that discovers extraction rules from a set of manually labelled texts. Extraction models are subsequently applied in an automatic mode on unseen texts. We report an application to a real-world dataset composed by publications selected to support biologists in the annotation of the HmtDB database