Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach

OBJECTIVE The amount of new discoveries (as published in the scientific literature) in the biomedical area is growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and thus the extraction of the core information becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. MATERIALS AND METHODS This paper presents and evaluates an approach aimed at automating the process of extracting functional relations (e.g. interactions between genes and proteins) from scientific literature in the biomedical domain. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus. RESULTS We have implemented a state-of-the-art text mining system for biomedical literature, based on a deep-linguistic, full-parsing approach. The results are validated on two different corpora: the manually annotated genomics information access (GENIA) corpus and the automatically annotated arabidopsis thaliana circadian rhythms (ATCR) corpus. CONCLUSION We show how a deep-linguistic approach (contrary to common belief) can be used in a real world text mining application, offering high-precision relation extraction, while at the same time retaining a sufficient recall.

[1]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[2]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[3]  Ueli Schibler,et al.  The daily rhythms of genes, cells and organs , 2005, EMBO reports.

[4]  Fabio Rinaldi,et al.  Fast, deep-linguistic statistical minimalist dependency parsing , 2004, COLING 2004.

[5]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[6]  Fabio Rinaldi,et al.  An environment for relation mining over richly annotated corpora: the case of GENIA , 2006, BMC Bioinformatics.

[7]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[8]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[9]  Judita Preiss Using Grammatical Relations to Compare Parsers , 2003, EACL.

[10]  Gerold Schneider,et al.  Extracting and using trace-free functional dependencies from the penn treebank to reduce parsing complexity , 2003 .

[11]  Jun'ichi Tsujii,et al.  Corpus-Oriented Grammar Development for Acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank , 2004, IJCNLP.

[12]  Ted Briscoe,et al.  Parser evaluation: using a grammatical relation annotation scheme , 2003 .

[13]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[14]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[15]  Marco Roos,et al.  Learning Biological Interactions from Medline Abstracts , 2005 .

[16]  Mark Stevenson,et al.  Automatically acquiring a linguistically motivated genic interaction extraction system , 2005, ICML 2005.

[17]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[18]  Anne Abeillé,et al.  Treebanks: Building and Using Parsed Corpora , 2003 .

[19]  Fabio Rinaldi,et al.  Multilayer annotations in Parmenides , 2003 .

[20]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[21]  Sergei Egorov,et al.  MedScan, a natural language processing engine for MEDLINE abstracts , 2003, Bioinform..

[22]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[23]  Ralph Grishman Proceedings of the fifth conference on Applied natural language processing , 1997 .

[24]  E. Tobin,et al.  All in good time: the Arabidopsis circadian clock. , 2000, Trends in plant science.

[25]  Fabio Rinaldi,et al.  Relation Mining over a Corpus of Scientific Literature , 2005, AIME.

[26]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[27]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[28]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.