Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome

BackgroundExtensive protein interaction maps are being constructed for yeast, worm, and fly to ask how the proteins organize into pathways and systems, but no such genome-wide interaction map yet exists for the set of human proteins. To prepare for studies in humans, we wished to establish tests for the accuracy of future interaction assays and to consolidate the known interactions among human proteins.ResultsWe established two tests of the accuracy of human protein interaction datasets and measured the relative accuracy of the available data. We then developed and applied natural language processing and literature-mining algorithms to recover from Medline abstracts 6,580 interactions among 3,737 human proteins. A three-part algorithm was used: first, human protein names were identified in Medline abstracts using a discriminator based on conditional random fields, then interactions were identified by the co-occurrence of protein names across the set of Medline abstracts, filtering the interactions with a Bayesian classifier to enrich for legitimate physical interactions. These mined interactions were combined with existing interaction data to obtain a network of 31,609 interactions among 7,748 human proteins, accurate to the same degree as the existing datasets.ConclusionThese interactions and the accuracy benchmarks will aid interpretation of current functional genomics data and provide a basis for determining the quality of future large-scale human protein interaction assays. Projecting from the approximately 15 interactions per protein in the best-sampled interaction set to the estimated 25,000 human genes implies more than 375,000 interactions in the complete human protein interaction network. This set therefore represents no more than 10% of the complete network.

[1]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[2]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[3]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[5]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[6]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[7]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[8]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[9]  Gary D Bader,et al.  Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants , 2001, Science.

[10]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11]  H. Herzel,et al.  Is there a bias in proteome research? , 2001, Genome research.

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[14]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[15]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..

[16]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[17]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[18]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[19]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[20]  Charles DeLisi,et al.  Predictome: a database of putative functional links between proteins , 2002, Nucleic Acids Res..

[21]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[22]  B. Snel,et al.  Function prediction and protein networks. , 2003, Current opinion in cell biology.

[23]  Huiqing Liu,et al.  Data Mining Tools for Biological Sequences , 2003, J. Bioinform. Comput. Biol..

[24]  James R. Knight,et al.  A Protein Interaction Map of Drosophila melanogaster , 2003, Science.

[25]  M. Huynen,et al.  Prediction of protein function and pathways in the genome era , 2004, Cellular and Molecular Life Sciences CMLS.

[26]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[27]  A. Fraser,et al.  A first-draft human protein-interaction map , 2004, Genome Biology.

[28]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[29]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[30]  Giulio Superti-Furga,et al.  A physical and functional map of the human TNF-α/NF-κB signal transduction pathway , 2004, Nature Cell Biology.

[31]  G. Casari,et al.  A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. , 2004, Nature cell biology.

[32]  Gary D Bader,et al.  Global Mapping of the Yeast Genetic Interaction Network , 2004, Science.

[33]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[34]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[35]  Edward M Marcotte,et al.  LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. , 2004, Journal of molecular biology.

[36]  Dipanwita Roy Chowdhury,et al.  Human protein reference database as a discovery resource for proteomics , 2004, Nucleic Acids Res..

[37]  S. L. Wong,et al.  A Map of the Interactome Network of the Metazoan C. elegans , 2004, Science.

[38]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[39]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[40]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[41]  J. Wojcik,et al.  Functional proteomics mapping of a human signaling pathway. , 2004, Genome research.

[42]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[43]  Lincoln Stein,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Res..