Targeted analyses of very large genome-wide data collections

Genome-scale experiments provide an overwhelming amount of molecular information for biologist. New computational methods are needed for specific analysis and interpretation of such high-dimensional data. Here we take advantage of the massive public repositories to quantify the tissue-specific signals in gene expression profiles, characterize distinctive molecular features of human diseases, deconvolve the latent cell-type-specific factors in mixed clinical samples, and automatically integrate heterogeneous data sources in the context of a specific genome-wide dataset. First, we describe URSA (Unveiling RNA Sample Annotation) that incorporates the known tissue/cell-type relationships to better estimate the specific signal in any given gene expression profile. Our ontology-aware method combines independent discriminative classifiers in a Bayesian framework, outperforming other machine learning methods. We provide a molecular interpretation for the tissue and cell-type models learned by URSA, enabling a data-driven view of molecular processes specific to particular tissues and cell types. Then, we extend this work for human diseases. We use thousands of clinical disease-specific expression profiles in public repositories to quantify distinctive functional and anatomical characteristics of human diseases. Through our data-driven analysis, we explore the complexity of the human disease landscape and propose exploratory hypothesis for drug repurposing even for rare disease with no prior genetic knowledge. Lastly, we describe YETI (Your Evidence Tailored Integration) for targeted integration of heterogeneous genome-wide data sources. Biomedical researchers generate genome-wide datasets for data-driven exploration of specific questions but such analyses are disconnect from big public data collections. YETI is the first automatic integration method that effectively constructs functional networks specific to a genome-scale dataset. We show that the resulting integration reflect the biological context of the user-provided dataset while providing accurate prediction for functional interactions. iii

[1]  P. Mazzola,et al.  Effects of High-Dose Cisplatin Chemotherapy and Conventional Radiotherapy on Urinary Oxidative and Nitrosative Stress Biomarkers in Patients with Head and Neck Cancer. , 2016, Basic & clinical pharmacology & toxicology.

[2]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[3]  M. Takeichi,et al.  The calcium-dependent cell-cell adhesion system regulates inner cell mass formation and cell surface polarization in early mouse development , 1983, Cell.

[4]  Ana Rath,et al.  Rare diseases in ICD11: making rare diseases visible in health information systems through appropriate coding , 2015, Orphanet Journal of Rare Diseases.

[5]  Akhilesh Pandey,et al.  Human Protein Reference Database and Human Proteinpedia as discovery tools for systems biology. , 2009, Methods in molecular biology.

[6]  Kai Li,et al.  Exploring the functional landscape of gene expression: directed search of large microarray compendia , 2007, Bioinform..

[7]  M. Fleming,et al.  Sideroblastic anemia: diagnosis and management. , 2014, Hematology/oncology clinics of North America.

[8]  Malka Gorfine,et al.  Comment on “ Detecting Novel Associations in Large Data Sets ” , 2012 .

[9]  Matthew A. Hibbs,et al.  Exploring the human genome with functional maps. , 2009, Genome research.

[10]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[11]  Alterations in gene promoter methylation and transcript expression induced by cisplatin in comparison to 5-Azacytidine in HeLa and SiHa cervical cancer cell lines , 2015, Molecular and Cellular Biochemistry.

[12]  Antje Chang,et al.  The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources , 2010, Nucleic Acids Res..

[13]  V. Sheffield,et al.  Glucocorticoid induction of the glaucoma gene MYOC in human and monkey trabecular meshwork cells and tissues. , 2001, Investigative ophthalmology & visual science.

[14]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[15]  V. Pascual,et al.  Assessing the human immune system through blood transcriptomics , 2010, BMC Biology.

[16]  Casey S. Greene,et al.  Functional Knowledge Transfer for High-accuracy Prediction of Under-studied Biological Processes , 2013, PLoS Comput. Biol..

[17]  Ron Shamir,et al.  Integrated analysis of numerous heterogeneous gene expression profiles for detecting robust disease-specific biomarkers and proposing drug targets , 2015, Nucleic acids research.

[18]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[19]  Elias Campo Guerri,et al.  International network of cancer genome projects , 2010 .

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  Olga G. Troyanskaya,et al.  The Sleipnir library for computational functional genomics , 2008, Bioinform..

[22]  A. Knight,et al.  The common problem of rare disease in general practice , 2006, The Medical journal of Australia.

[23]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  O. Troyanskaya,et al.  Predicting gene function in a hierarchical context with an ensemble of classifiers , 2008, Genome Biology.

[25]  Pawan Kumar Gupta,et al.  Glycosaminoglycans enhance osteoblast differentiation of bone marrow derived human mesenchymal stem cells , 2014, Journal of tissue engineering and regenerative medicine.

[26]  N. Wermuth,et al.  Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative , 1989 .

[27]  D. Sinclair,et al.  Resveratrol accelerates erythroid maturation by activation of FoxO3 and ameliorates anemia in beta-thalassemic mice , 2014, Haematologica.

[28]  Marek J. Druzdzel,et al.  SMILE: Structural Modeling, Inference, and Learning Engine and GeNIE: A Development Environment for Graphical Decision-Theoretic Models , 1999, AAAI/IAAI.

[29]  C. O'keefe,et al.  Aberrant DNA methylation is a dominant mechanism in MDS progression to AML. , 2009, Blood.

[30]  Joel Dudley,et al.  Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets , 2010, PLoS Comput. Biol..

[31]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[32]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[33]  Maria Keays,et al.  ArrayExpress update—trends in database growth and links to data analysis tools , 2012, Nucleic Acids Res..

[34]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database's 10th year anniversary: update 2015 , 2014, Nucleic Acids Res..

[35]  D. Richardson,et al.  Mitochondrial Iron Metabolism and Sideroblastic Anemia , 2009, Acta Haematologica.

[36]  Bonnie Berger,et al.  Making sense out of massive data by going beyond differential expression , 2012, Proceedings of the National Academy of Sciences.

[37]  Christopher DeCoro,et al.  Hierarchical Shape Classification Using Bayesian Aggregation , 2006, IEEE International Conference on Shape Modeling and Applications 2006 (SMI'06).

[38]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.

[39]  Joshua M. Stuart,et al.  Subtype and pathway specific responses to anticancer compounds in breast cancer , 2011, Proceedings of the National Academy of Sciences.

[40]  M. Weiss,et al.  Anemia: progress in molecular mechanisms and therapies , 2015, Nature Medicine.

[41]  P. Hevezi,et al.  Gene expression analyses reveal molecular relationships among 20 regions of the human CNS , 2006, Neurogenetics.

[42]  Daniel S. Himmelstein,et al.  Understanding multicellular function and disease with human tissue-specific networks , 2015, Nature Genetics.

[43]  Casey S. Greene,et al.  PILGRM: an interactive data-driven discovery platform for expert biologists , 2011, Nucleic Acids Res..

[44]  Yan Wang,et al.  Gene expression profiling differentiates germ cell tumors from other cancers and defines subtype-specific signatures. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Lia Kent Culture and Maintenance of Human Embryonic Stem Cells , 2009, Journal of visualized experiments : JoVE.

[46]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[47]  Matthew N. McCall,et al.  The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes , 2010, Nucleic Acids Res..

[48]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[49]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[50]  Adam A. Margolin,et al.  Addendum: The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity , 2012, Nature.

[51]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[52]  Arek Kasprzyk,et al.  BioMart: driving a paradigm change in biological data management , 2011, Database J. Biol. Databases Curation.

[53]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[54]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[55]  A. Gottlieb,et al.  Etanercept as monotherapy in patients with psoriasis. , 2003, The New England journal of medicine.

[56]  M. Peifer,et al.  Wnt signaling in oncogenesis and embryogenesis--a look outside the nucleus. , 2000, Science.

[57]  Roberto Gambari,et al.  Resveratrol: Antioxidant activity and induction of fetal hemoglobin in erythroid cells from normal donors and β-thalassemia patients. , 2012, International journal of molecular medicine.

[58]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[59]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Isabelle Cleynen,et al.  Mucosal Gene Expression of Antimicrobial Peptides in Inflammatory Bowel Disease Before and After First Infliximab Treatment , 2009, PloS one.

[61]  C. Bouchard,et al.  G-308A polymorphism of the tumor necrosis factor alpha gene promoter and salivary cortisol secretion. , 2001, The Journal of clinical endocrinology and metabolism.

[62]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[63]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[64]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[65]  Y. Murata,et al.  Effect of different concentrations of amino acids in human serum and follicular fluid on the development of one-cell mouse embryos in vitro. , 1997, Journal of reproduction and fertility.

[66]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[67]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[68]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[69]  Jeffrey M. Rosen,et al.  Residual breast cancers after conventional therapy display mesenchymal as well as tumor-initiating features , 2009, Proceedings of the National Academy of Sciences.

[70]  Yoav Gilad,et al.  A reanalysis of mouse ENCODE comparative gene expression data , 2015, F1000Research.

[71]  Martin Krallinger,et al.  Analysis of biological processes and diseases using text mining approaches. , 2010, Methods in molecular biology.

[72]  C. Steidl,et al.  Results of a phase 2 study of valproic acid alone or in combination with all-trans retinoic acid in 75 patients with myelodysplastic syndrome and relapsed or refractory acute myeloid leukemia , 2005, Annals of Hematology.

[73]  Christopher Y. Park,et al.  Interactive Big Data Resource to Elucidate Human Immune Pathways and Diseases. , 2015, Immunity.

[74]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[75]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[76]  M. Cazzola,et al.  Refractory anemia with ring sideroblasts. , 2013, Best practice & research. Clinical haematology.

[77]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[78]  Fred A. Wright,et al.  seeQTL: a searchable database for human eQTLs , 2011, Bioinform..

[79]  Hugh A. Rand,et al.  IL-17 Receptor - Brodalumab, a Human Anti Psoriatic Skin by Treatment with Gene Expression Profiles Normalized in , 2014 .

[80]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2011 update , 2010, Nucleic Acids Res..

[81]  A. Cuneo,et al.  Refractory anemia with excess blasts (RAEB) , 2011 .

[82]  Casey S. Greene,et al.  IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks , 2012, Nucleic Acids Res..

[83]  Theodore R Holford,et al.  Genetic variation in TNF and IL10 and risk of non-Hodgkin lymphoma: a report from the InterLymph Consortium. , 2006, The Lancet. Oncology.

[84]  Paul H. Huang,et al.  The Pathobiology of Collagens in Glioma , 2013, Molecular Cancer Research.

[85]  Franco Ameglio,et al.  Cytokines in psoriasis , 1999, International journal of dermatology.

[86]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[87]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[88]  P. Greengard,et al.  Maintenance of pluripotency in human and mouse embryonic stem cells through activation of Wnt signaling by a pharmacological GSK-3-specific inhibitor , 2004, Nature Medicine.

[89]  I. Lauder,et al.  The Significance of Lymphocytic Infiltration in Neuroblastoma , 1972, British Journal of Cancer.

[90]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[91]  William Stafford Noble,et al.  FIMO: scanning for occurrences of a given motif , 2011, Bioinform..

[92]  Olga G. Troyanskaya,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm332 Data and text mining , 2022 .

[93]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[94]  Kai Li,et al.  Targeted exploration and analysis of large cross-platform human transcriptomic compendia , 2015, Nature Methods.

[95]  J. Friedman,et al.  SOD2-deficiency sideroblastic anemia and red blood cell oxidative stress. , 2006, Antioxidants & redox signaling.

[96]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[97]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[98]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[99]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[100]  Scott A. Rifkin,et al.  Revealing the architecture of gene regulation: the promise of eQTL studies. , 2008, Trends in genetics : TIG.

[101]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[102]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2009 update , 2009, Nucleic Acids Res..

[103]  R. Somasundaram,et al.  Chemokines and the microenvironment in neuroectodermal tumor-host interaction. , 2009, Seminars in cancer biology.

[104]  Wei-Min Liu,et al.  Robust estimators for expression analysis , 2002, Bioinform..

[105]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[106]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[107]  D. Bowen,et al.  Durable second complete remissions with oral melphalan in hypocellular Acute Myeloid Leukemia and Refractory Anemia with Excess Blast with normal karyotype relapsing after intensive chemotherapy. , 2013, Leukemia research reports.

[108]  D. Botstein,et al.  A DNA microarray survey of gene expression in normal human tissues , 2005, Genome Biology.

[109]  Garret A FitzGerald,et al.  Biological basis for the cardiovascular consequences of COX-2 inhibition: therapeutic challenges and opportunities. , 2005, The Journal of clinical investigation.

[110]  Chun-Chi Liu,et al.  Bayesian approach to transforming public gene expression repositories into disease diagnosis databases , 2010, Proceedings of the National Academy of Sciences.

[111]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[112]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[113]  Peter A Merkel,et al.  Clinical research for rare disease: opportunities, challenges, and solutions. , 2009, Molecular genetics and metabolism.

[114]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[115]  R. Nusse,et al.  The Wnt signaling pathway in development and disease. , 2004, Annual review of cell and developmental biology.

[116]  Jesse M. Engreitz,et al.  ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression , 2011, Bioinform..

[117]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[118]  D. Lansky,et al.  The missing link: bridging the patient-provider health information gap. , 2005, Health affairs.

[119]  R. Irizarry,et al.  A gene expression bar code for microarray data , 2007, Nature Methods.

[120]  K. Boycott,et al.  Rare-disease genetics in the era of next-generation sequencing: discovery to translation , 2013, Nature Reviews Genetics.

[121]  Qian Zhu,et al.  Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies , 2013, Bioinform..

[122]  Olga G. Troyanskaya,et al.  Simultaneous Genome-Wide Inference of Physical, Genetic, Regulatory, and Functional Pathway Components , 2010, PLoS Comput. Biol..

[123]  R. Myers,et al.  Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data , 2005, Nucleic acids research.

[124]  D A Weitz,et al.  Glioma expansion in collagen I matrices: analyzing collagen concentration-dependent growth and motility patterns. , 2005, Biophysical journal.

[125]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[126]  E. Lundberg,et al.  Towards a knowledge-based Human Protein Atlas , 2010, Nature Biotechnology.

[127]  Eytan Domany,et al.  Classification of human astrocytic gliomas on the basis of gene expression: a correlated group of genes with angiogenic activity emerges as a strong predictor of subtypes. , 2003, Cancer research.

[128]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[129]  U. Germing,et al.  Refractory anaemia with excess of blasts (RAEB): analysis of reclassification according to the WHO proposals , 2006, British journal of haematology.

[130]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[131]  J. Kononen,et al.  Tissue microarrays for high-throughput molecular profiling of tumor specimens , 1998, Nature Medicine.

[132]  Zae Young Ryoo,et al.  Cytokine-like 1 (CYTL1) Regulates the Chondrogenesis of Mesenchymal Cells* , 2007, Journal of Biological Chemistry.

[133]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[134]  D. Stephan,et al.  A survey of genetic human cortical gene expression , 2007, Nature Genetics.

[135]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[136]  J. Okamura,et al.  Outcome of children with refractory anaemia with excess of blast (RAEB) and RAEB in Transformation (RAEB‐T) in the Japanese MDS99 study , 2012, British journal of haematology.

[137]  T. Barrette,et al.  ONCOMINE: a cancer microarray database and integrated data-mining platform. , 2004, Neoplasia.

[138]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[139]  N. Nakaya,et al.  Myocilin Is Involved in NgR1/Lingo-1-Mediated Oligodendrocyte Differentiation and Myelination of the Optic Nerve , 2014, The Journal of Neuroscience.

[140]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[141]  J. Quiles,et al.  Transcriptional Shift Identifies a Set of Genes Driving Breast Cancer Chemoresistance , 2013, PloS one.

[142]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[143]  J. Itskovitz‐Eldor,et al.  Maintenance of human embryonic stem cells in animal serum- and feeder layer-free culture conditions. , 2006, Methods in molecular biology.