The GENIA Project: Knowledge Acquisition from Biology Texts

Overview of Project The GENIA project [9] (Fig. 1) seeks to automatically extract useful information from texts written by scientists to help overcome the problems caused by information overload. We intend that while the methods are customized for application in the microbiology domain, the basic methods should be generalisable to knowledge acquisition in other scientific and engineering domains. The challenge of extracting and classifying molecular biology terminology is significant. We aim to work with domain experts to build tools to identify terminology for objects such as proteins and genes as well as discovering the relations between them. So far we have achieved considerable success using hidden Markov models [1] and decision trees [5] that learn from a marked-up corpus [6, 8]. The results can be used to build gazettes, formulate nomenclatures and ontologies, index documents for searching and add to medical databases. We have also developed unsupervised methods for extracting a thesaurus from large domain text collections [3] and a supervised method based on boosting for MEDLINE abstract classification [2]. In addition to processing methods, we are developing an annotated corpus in which the structure of a text, the structure of sentences, and the semantics of terms based on a domain ontology [7] are marked up by experts. They serve as a knowledge resource on which the learning models and other methodologies can be developed. We have also developed an annotation tool that manages multiple tag sets on a text [4] and GPML (Genia Project Markup Language) [10], as well as definitions of various information to be annotated on the abstracts. The annotation tool incorporates the terminology extraction process we have developed to help annotators by showing the candidates of terms. We are currently working on the key task of extracting event information about protein interactions. This type of information extraction requires the joint effort of many sources of knowledge, which we are now developing. These include a parser, ontology, thesaurus and domain dictionaries as well as supervised learning models.