Alignment of gene expression profiles from test samples against a reference database: New method for context-specific interpretation of microarray data

BackgroundGene expression microarray data have been organized and made available as public databases, but the utilization of such highly heterogeneous reference datasets in the interpretation of data from individual test samples is not as developed as e.g. in the field of nucleotide sequence comparisons. We have created a rapid and powerful approach for the alignment of microarray gene expression profiles (AGEP) from test samples with those contained in a large annotated public reference database and demonstrate here how this can facilitate interpretation of microarray data from individual samples.MethodsAGEP is based on the calculation of kernel density distributions for the levels of expression of each gene in each reference tissue type and provides a quantitation of the similarity between the test sample and the reference tissue types as well as the identity of the typical and atypical genes in each comparison. As a reference database, we used 1654 samples from 44 normal tissues (extracted from the Genesapiens database).ResultsUsing leave-one-out validation, AGEP correctly defined the tissue of origin for 1521 (93.6%) of all the 1654 samples in the original database. Independent validation of 195 external normal tissue samples resulted in 87% accuracy for the exact tissue type and 97% accuracy with related tissue types. AGEP analysis of 10 Duchenne muscular dystrophy (DMD) samples provided quantitative description of the key pathogenetic events, such as the extent of inflammation, in individual samples and pinpointed tissue-specific genes whose expression changed (SAMD4A) in DMD. AGEP analysis of microarray data from adipocytic differentiation of mesenchymal stem cells and from normal myeloid cell types and leukemias provided quantitative characterization of the transcriptomic changes during normal and abnormal cell differentiation.ConclusionsThe AGEP method is a widely applicable method for the rapid comprehensive interpretation of microarray data, as proven here by the definition of tissue- and disease-specific changes in gene expression as well as during cellular differentiation. The capability to quantitatively compare data from individual samples against a large-scale annotated reference database represents a widely applicable paradigm for the analysis of all types of high-throughput data. AGEP enables systematic and quantitative comparison of gene expression data from test samples against a comprehensive collection of different cell/tissue types previously studied by the entire research community.

[1]  E. Levanon,et al.  Human housekeeping genes are compact. , 2003, Trends in genetics : TIG.

[2]  B. Shneiderman,et al.  Nuclear envelope dystrophies show a transcriptional fingerprint suggesting disruption of Rb-MyoD pathways in muscle regeneration. , 2006, Brain : a journal of neurology.

[3]  G. Tsujimoto,et al.  Temporal gene expression changes during adipogenesis in human mesenchymal stem cells. , 2003, Biochemical and biophysical research communications.

[4]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[5]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[6]  Justin Lamb,et al.  The Connectivity Map: a new tool for biomedical research , 2007, Nature Reviews Cancer.

[7]  John T. Wei,et al.  Integrative molecular concept modeling of prostate cancer progression , 2007, Nature Genetics.

[8]  R. Irizarry,et al.  A gene expression bar code for microarray data , 2007, Nature Methods.

[9]  Mei Han,et al.  Gene expression profiling of Duchenne muscular dystrophy skeletal muscle , 2003, Neurogenetics.

[10]  L. Hood,et al.  Dysregulated gene expression networks in human acute myelogenous leukemia stem cells , 2009, Proceedings of the National Academy of Sciences.

[11]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Stephen W Michnick,et al.  The connectivity map , 2006, Nature chemical biology.

[13]  S. Nelson,et al.  Celsius: a community resource for Affymetrix microarray data , 2007, Genome Biology.

[14]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[15]  Samuel Kaski,et al.  Probabilistic retrieval and visualization of biologically relevant microarray experiments , 2009, BMC Bioinformatics.

[16]  Jaakko Astola,et al.  Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations , 2009, BMC Bioinformatics.

[17]  M. Mann,et al.  Proteome differences between brown and white fat mitochondria reveal specialized metabolic functions. , 2009, Cell metabolism.

[18]  Sayan Mukherjee,et al.  Molecular classification of multiple tumor types , 2001, ISMB.

[19]  William Stafford Noble,et al.  Support vector machine , 2013 .

[20]  L. Kunkel,et al.  Gene expression comparison of biopsies from Duchenne muscular dystrophy (DMD) and normal skeletal muscle , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Xiaojiang Xu,et al.  Learning module networks from genome‐wide location and expression data , 2004, FEBS letters.

[22]  M. Sharma,et al.  Skeletal muscle metabolism in Duchenne muscular dystrophy (DMD): an in-vitro proton NMR spectroscopy study. , 2003, Magnetic resonance imaging.

[23]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[24]  R. Kucherlapati,et al.  A Role for the CHC22 Clathrin Heavy-Chain Isoform in Human Glucose Metabolism , 2009, Science.

[25]  B. Spiegelman,et al.  AdipoQ Is a Novel Adipose-specific Gene Dysregulated in Obesity (*) , 1996, The Journal of Biological Chemistry.

[26]  Soheil Meshinchi,et al.  Identification of genes with abnormal expression changes in acute myeloid leukemia , 2008, Genes, chromosomes & cancer.

[27]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[28]  T. Barrette,et al.  ONCOMINE: a cancer microarray database and integrated data-mining platform. , 2004, Neoplasia.

[29]  T. Barrette,et al.  Mining for regulatory programs in the cancer transcriptome , 2005, Nature Genetics.

[30]  D. Koller,et al.  A module map showing conditional activity of expression modules in cancer , 2004, Nature Genetics.

[31]  Keinosuke Fukunaga,et al.  Chapter 7 – NONPARAMETRIC CLASSIFICATION AND ERROR ESTIMATION , 1990 .

[32]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[33]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[35]  David W. Scott,et al.  Smoothing by weighted averaging of rounded points , 1990 .

[36]  E Mjolsness,et al.  Machine learning for science: state of the art and future prospects. , 2001, Science.

[37]  P. Pattany,et al.  Allogeneic mesenchymal stem cells restore cardiac function in chronic ischemic cardiomyopathy via trilineage differentiating capacity , 2009, Proceedings of the National Academy of Sciences.

[38]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[39]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[41]  G. Stephanopoulos,et al.  A compendium of gene expression in normal human tissues. , 2001, Physiological genomics.

[42]  G. Parmigiani,et al.  A statistical framework for expression‐based molecular classification in cancer , 2002 .

[43]  L. Werneck,et al.  Duchenne and Becker muscular dystrophy: a molecular and immunohistochemical approach. , 2007, Arquivos de neuro-psiquiatria.

[44]  J. Astola,et al.  Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues , 2008, Genome Biology.

[45]  G. Sherlock Analysis of large-scale gene expression data. , 2000, Current opinion in immunology.

[46]  F. Ferrari,et al.  Genomic expression during human myelopoiesis , 2007, BMC Genomics.

[47]  C. Mariash,et al.  The Spot 14 protein is required for de novo lipid synthesis in the lactating mammary gland. , 2005, Endocrinology.

[48]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[49]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[50]  Jay Snoddy,et al.  Gene expression profiling in human preadipocytes and adipocytes by microarray analysis. , 2004, The Journal of nutrition.

[51]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[52]  Paolo Massimo Buscema,et al.  The semantic connectivity map: an adapting self-organising knowledge discovery method in data bases. Experience in gastro-oesophageal reflux disease , 2008, Int. J. Data Min. Bioinform..

[53]  Sergio Contrino,et al.  ArrayExpress—a public repository for microarray gene expression data at the EBI , 2004, Nucleic Acids Res..

[54]  J. Tidball,et al.  Do immune cells promote the pathology of dystrophin-deficient myopathies? , 2001, Neuromuscular Disorders.