Extracting Regulatory Gene Expression Networks From Pubmed

We present an approach using syntacto-semantic rules for the extraction of relational information from biomedical abstracts. The results show that by overcoming the hurdle of technical terminology, high precision results can be achieved. From abstracts related to baker's yeast, we manage to extract a regulatory network comprised of 441 pairwise relations from 58,664 abstracts with an accuracy of 83-90%. To achieve this, we made use of a resource of gene/protein names considerably larger than those used in most other biology related information extraction approaches. This list of names was included in the lexicon of our retrained part-of-speech tagger for use on molecular biology abstracts. For the domain in question an accuracy of 93.6-97.7% was attained on POS-tags. The method is easily adapted to other organisms than yeast, allowing us to extract many more biologically relevant relations.

[1]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[2]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[3]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[4]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[5]  Peer Bork,et al.  The way we write , 2003, EMBO Reports.

[6]  Goran Nenadic,et al.  Selecting Text Features for Gene Name Classification: from Documents to Terms , 2003, BioNLP@ACL.

[7]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO) , 2002, Nucleic Acids Res..

[8]  Yuji Matsumoto,et al.  Protein Name Tagging for Biomedical Annotation in Text , 2003, BioNLP@ACL.

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[11]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[12]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[13]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[14]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information , 2021, Nucleic Acids Res..

[15]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[16]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[17]  Jerry R. Hobbs Information extraction from biomedical text , 2002, J. Biomed. Informatics.

[18]  Helmut Schmid Unsupervised Learning of Period Disambiguation for Tokenisation , 2000 .

[19]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[20]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[21]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[22]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.