Assessing semantic similarity measures for the characterization of human regulatory pathways

MOTIVATION Pathway modeling requires the integration of multiple data including prior knowledge. In this study, we quantitatively assess the application of Gene Ontology (GO)-derived similarity measures for the characterization of direct and indirect interactions within human regulatory pathways. The characterization would help the integration of prior pathway knowledge for the modeling. RESULTS Our analysis indicates information content-based measures outperform graph structure-based measures for stratifying protein interactions. Measures in terms of GO biological process and molecular function annotations can be used alone or together for the validation of protein interactions involved in the pathways. However, GO cellular component-derived measures may not have the ability to separate true positives from noise. Furthermore, we demonstrate that the functional similarity of proteins within known regulatory pathways decays rapidly as the path length between two proteins increases. Several logistic regression models are built to estimate the confidence of both direct and indirect interactions within a pathway, which may be used to score putative pathways inferred from a scaffold of molecular interactions.

[1]  Shmuel Sattath,et al.  How reliable are experimental protein-protein interaction data? , 2003, Journal of molecular biology.

[2]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[3]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[4]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[5]  R. Gentleman,et al.  Visualizing and Distances Using GO , 2006 .

[6]  Zhilei Chen,et al.  A highly sensitive selection method for directed evolution of homing endonucleases , 2005, Nucleic acids research.

[7]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[8]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[9]  C. Deane,et al.  Protein Interactions , 2002, Molecular & Cellular Proteomics.

[10]  Dong Xu,et al.  Computational analyses of high-throughput protein-protein interaction data. , 2003, Current protein & peptide science.

[11]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[12]  Alfonso Valencia,et al.  Computational methods for the prediction of protein interaction partners , 2004 .

[13]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[14]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[17]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[18]  G. Drewes,et al.  Global approaches to protein-protein interactions. , 2003, Current opinion in cell biology.

[19]  Igor Jurisica,et al.  Online Predicted Human Interaction Database , 2005, Bioinform..

[20]  Trey Ideker,et al.  Building with a scaffold: emerging strategies for high- to low-level cellular modeling. , 2003, Trends in biotechnology.

[21]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[22]  J. Rothberg,et al.  Gaining confidence in high-throughput protein interaction networks , 2004, Nature Biotechnology.

[23]  A. Fraser,et al.  A first-draft human protein-interaction map , 2004, Genome Biology.

[24]  M. Vidal,et al.  Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". , 2001, Genome research.

[25]  Mark Gerstein,et al.  Information assessment on predicting protein-protein interactions , 2004, BMC Bioinformatics.

[26]  A. Valencia,et al.  Computational methods for the prediction of protein interactions. , 2002, Current opinion in structural biology.

[27]  R. Karp,et al.  From the Cover : Conserved patterns of protein interaction in multiple species , 2005 .

[28]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[29]  M. Gerstein,et al.  Assessing the limits of genomic data integration for predicting protein networks. , 2005, Genome research.