BioPubMiner: Machine Learning Component-Based Biomedical Information Analysis Platform

In this paper we introduce BioPubMiner, a machine learning component-based platform for biomedical information analysis. BioPubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as protein-protein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public database to infer more interactions from the original interactions. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S.cerevisiae (bakers yeast) from MIPS and SGD.

[1]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms , 2004, Nucleic Acids Res..

[2]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[3]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[4]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[5]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[7]  William H. Press,et al.  Numerical recipes in C , 2002 .

[8]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[9]  Hae-Chang Rim,et al.  Two-Phase Biomedical NE Recognition based on SVMs , 2003, BioNLP@ACL.

[10]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[11]  Jung-Hsien Chiang,et al.  GIS: a biomedical text-mining system for gene information discovery , 2004, Bioinform..

[12]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[13]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[14]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[15]  Kenji Satou,et al.  Extraction of knowledge on protein-protein interaction by association rule discovery , 2002, Bioinform..

[16]  Tsviya Olender,et al.  Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE , 2003, Nucleic Acids Res..

[17]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[18]  P Bork,et al.  XplorMed: a tool for exploring MEDLINE abstracts. , 2001, Trends in biochemical sciences.

[19]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[20]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[21]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[22]  Hae-Chang Rim,et al.  Weighted Probabilistic Sum Model Based on Decision Tree Decomposition for Text Chunking , 2003, Int. J. Comput. Process. Orient. Lang..

[23]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2006, Nucleic Acids Res..

[24]  P Bork,et al.  Automated extraction of information in molecular biology , 2000, FEBS letters.