Bayesian approach to transforming public gene expression repositories into disease diagnosis databases

The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus is currently the largest database that systematically documents the genome-wide molecular basis of diseases. However, thus far, this resource has been far from fully utilized. This paper describes the first study to transform public gene expression repositories into an automated disease diagnosis database. Particularly, we have developed a systematic framework, including a two-stage Bayesian learning approach, to achieve the diagnosis of one or multiple diseases for a query expression profile along a hierarchical disease taxonomy. Our approach, including standardizing cross-platform gene expression data and heterogeneous disease annotations, allows analyzing both sources of information in a unified probabilistic system. A high level of overall diagnostic accuracy was shown by cross validation. It was also demonstrated that the power of our method can increase significantly with the continued growth of public gene expression repositories. Finally, we showed how our disease diagnosis system can be used to characterize complex phenotypes and to construct a disease-drug connectivity map.

[1]  R. Tibshirani,et al.  Disease signatures are robust across tissues and experiments , 2009, Molecular systems biology.

[2]  Rong Chen,et al.  Ontology-driven indexing of public datasets for translational bioinformatics , 2009, BMC Bioinformatics.

[3]  Kai Li,et al.  Exploring the functional landscape of gene expression: directed search of large microarray compendia , 2007, Bioinform..

[4]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[5]  Pankaj Agarwal,et al.  Gene Vector Analysis (Geneva): A unified method to detect differentially-regulated gene sets and similar microarray experiments , 2008, BMC Bioinformatics.

[6]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[7]  A. Butte,et al.  Creation and implications of a phenome-genome network , 2006, Nature Biotechnology.

[8]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[9]  Yidong Chen,et al.  GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus , 2008, Bioinform..

[10]  M. Adamcová,et al.  Anthracycline-induced cardiotoxicity. , 2000, Acta medica.

[11]  K. Coombes,et al.  Comparison of the predictive accuracy of DNA array-based multigene classifiers across cDNA arrays and Affymetrix GeneChips. , 2005, The Journal of molecular diagnostics : JMD.

[12]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[13]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[14]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[15]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[16]  Roland Eils,et al.  Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes , 2005, BMC Bioinformatics.

[17]  Jianjun Hu,et al.  Integrative disease classification based on cross-platform microarray data , 2009, BMC Bioinformatics.

[18]  W. L. Hunter,et al.  Topoisomerase inhibitors as anti-arthritic agents , 2008, Inflammation Research.

[19]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[20]  Paul B Horton,et al.  RaPiDS: an algorithm for rapid expression profile database search. , 2006, Genome informatics. International Conference on Genome Informatics.

[21]  T M Therneau,et al.  Weight change in women treated with adjuvant therapy or observed following mastectomy for node-positive breast cancer. , 1990, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[22]  M. Bondi,et al.  Weight Gain in Women with Breast Cancer Treated with Adjuvant Cyclophosphomide, Methotrexate and 5-Fluorouracil. Analysis of Resting Energy Expenditure and Body Composition , 2002, Breast Cancer Research and Treatment.

[23]  Xuyang Peng,et al.  The cardiotoxicology of anthracycline chemotherapeutics: translating molecular mechanism into preventative medicine. , 2005, Molecular interventions.

[24]  Rong Chen,et al.  Finding Disease-Related Genomic Experiments Within an International Repository: First Steps in Translational Bioinformatics , 2006, AMIA.

[25]  E. Surmacz,et al.  Leptin and cancer , 2006, Journal of cellular physiology.

[26]  Maurie Markman,et al.  Skin toxicity associated with pegylated liposomal doxorubicin (40 mg/m2) in the treatment of gynecologic cancers. , 2005, Gynecologic oncology.

[27]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[28]  R. V. Vander Heide,et al.  Molecular basis of anthracycline-induced cardiotoxicity and its prevention. , 2000, Molecular genetics and metabolism.

[29]  B. Nilsson,et al.  Cross-platform classification in microarray-based leukemia diagnostics. , 2006, Haematologica.

[30]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.