Relation Mining over a Corpus of Scientific Literature

The amount of new discoveries (as published in the scientific literature) in the area of Molecular Biology is currently growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and the extraction of the core information, for inclusion in one of the knowledge resources being maintained by the research community, becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. This paper presents and evaluates an approach aimed at automating the process of extracting semantic relations (e.g. interactions between genes and proteins) from scientific literature in the domain of Molecular Biology. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus.

[1]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[2]  Jun'ichi Tsujii,et al.  Finding Anchor Verbs for Biomedical IE Using Predicate-Argument Structures , 2004, ACL.

[3]  Martin Volk Combining Unsupervised and Supervised Methods for PP Attachment Disambiguation , 2002, COLING.

[4]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[5]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[6]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[7]  Fabio Rinaldi,et al.  Answering Questions in the Genomics Domain , 2004, ACL 2004.

[8]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[9]  Ben Hutchinson,et al.  Intrinsic versus Extrinsic Evaluations of Parsing Systems , 2003 .

[10]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[11]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[12]  Tapio Salakoski,et al.  Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions , 2004, NLPBA/BioNLP.

[13]  Gerold Schneider,et al.  Extracting and using trace-free functional dependencies from the penn treebank to reduce parsing complexity , 2003 .

[14]  Anton Yuryev,et al.  Extracting Protein Function Information from MEDLINE Using a Full-Sentence Parser , 2004 .

[15]  Oi Yee Kwong,et al.  Natural Language Processing - IJCNLP 2004, First International Joint Conference, Hainan Island, China, March 22-24, 2004, Revised Selected Papers , 2005, IJCNLP.

[16]  Jong C. Park,et al.  Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar , 2000, Pacific Symposium on Biocomputing.

[17]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[18]  Fabio Rinaldi,et al.  Fast, deep-linguistic statistical minimalist dependency parsing , 2004, COLING 2004.

[19]  C. Sander,et al.  Growth in Bioinformatics , 2003, Bioinform..

[20]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[21]  Jun'ichi Tsujii,et al.  Corpus-Oriented Grammar Development for Acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank , 2004, IJCNLP.

[22]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[23]  Martin Romacker,et al.  Creating Knowledge Repositories from Biomedical Reports: The MEDSYNDIKATE Text Mining System , 2001, Pacific Symposium on Biocomputing.

[24]  Fabio Rinaldi,et al.  Multilayer annotations in Parmenides , 2003 .

[25]  Sergei Egorov,et al.  MedScan, a natural language processing engine for MEDLINE abstracts , 2003, Bioinform..