Finding disease similarity based on implicit semantic similarity

Genomics has contributed to a growing collection of gene-function and gene-disease annotations that can be exploited by informatics to study similarity between diseases. This can yield insight into disease etiology, reveal common pathophysiology and/or suggest treatment that can be appropriated from one disease to another. Estimating disease similarity solely on the basis of shared genes can be misleading as variable combinations of genes may be associated with similar diseases, especially for complex diseases. This deficiency can be potentially overcome by looking for common biological processes rather than only explicit gene matches between diseases. The use of semantic similarity between biological processes to estimate disease similarity could enhance the identification and characterization of disease similarity. We present functions to measure similarity between terms in an ontology, and between entities annotated with terms drawn from the ontology, based on both co-occurrence and information content. The similarity measure is shown to outperform other measures used to detect similarity. A manually curated dataset with known disease similarities was used as a benchmark to compare the estimation of disease similarity based on gene-based and Gene Ontology (GO) process-based comparisons. The detection of disease similarity based on semantic similarity between GO Processes (Recall=55%, Precision=60%) performed better than using exact matches between GO Processes (Recall=29%, Precision=58%) or gene overlap (Recall=88% and Precision=16%). The GO-Process based disease similarity scores on an external test set show statistically significant Pearson correlation (0.73) with numeric scores provided by medical residents. GO-Processes associated with similar diseases were found to be significantly regulated in gene expression microarray datasets of related diseases.

[1]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[2]  C. Sotiriou,et al.  Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures , 2007, Breast Cancer Research.

[3]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[4]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[5]  Carol Friedman,et al.  Extracting Phenotypic Information from the Literature via Natural Language Processing , 2004, MedInfo.

[6]  Albert-László Barabási,et al.  A Dynamic Network Approach for the Study of Human Phenotypes , 2009, PLoS Comput. Biol..

[7]  A. Barabasi,et al.  Molecular Systems Biology 5; Article number 262; doi:10.1038/msb.2009.16 Citation: Molecular Systems Biology 5:262 , 2022 .

[8]  Joel Dudley,et al.  Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets , 2010, PLoS Comput. Biol..

[9]  A. Butte,et al.  Creation and implications of a phenome-genome network , 2006, Nature Biotechnology.

[10]  K. Struhl,et al.  A transcriptional signature and common gene networks link cancer with lipid metabolism and diverse human diseases. , 2010, Cancer cell.

[11]  Deendayal Dinakarpandian,et al.  A New Metric to Measure Gene Product Similarity , 2007, BIBM.

[12]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[13]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[14]  M. Doran,et al.  Rheumatoid arthritis and diabetes mellitus: evidence for an association? , 2007, The Journal of rheumatology.

[15]  Alex Zelikovsky,et al.  Risk Factor Searching Heuristics for SNP Case-Control Studies , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[16]  Bethan Hughes,et al.  2009 FDA drug approvals , 2010, Nature Reviews Drug Discovery.

[17]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[18]  W. Kibbe,et al.  Annotating the human genome with Disease Ontology , 2009, BMC Genomics.

[19]  Terrence Adam,et al.  Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[20]  Pankaj Agarwal,et al.  A Pathway-Based View of Human Diseases and Disease Relationships , 2009, PloS one.

[21]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[22]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[23]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[24]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[25]  Matthew A. Hibbs,et al.  Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[26]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[27]  John D. Osborne,et al.  Annotating the human genome with Disease , 2009 .

[28]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[29]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[30]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[31]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[32]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[33]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[34]  Gerald J. Lieberman,et al.  On the Consecutive-k-of-n:F System , 1982, IEEE Transactions on Reliability.

[35]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[36]  S. Horvath,et al.  Variations in DNA elucidate molecular networks that cause disease , 2008, Nature.

[37]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[38]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[39]  Yibo Wu,et al.  GOSemSim: an R package for measuring semantic similarity among GO terms and gene products , 2010, Bioinform..

[40]  A. Rzhetsky,et al.  Probing genetic overlap among complex human phenotypes , 2007, Proceedings of the National Academy of Sciences.

[41]  Jing Zhu,et al.  Revealing and avoiding bias in semantic similarity scores for protein pairs , 2010, BMC Bioinformatics.

[42]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[43]  Ted Pedersen,et al.  Using semantic relatedness for word sense disambiguation , 2002 .

[44]  Jagdish Chandra Patra,et al.  Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network , 2010, Bioinform..

[45]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..

[46]  Deendayal Dinakarpandian,et al.  Automated Ontological Gene Annotation for Computing Disease Similarity , 2010, Summit on translational bioinformatics.