The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins

Genomics, epigenomics, transcriptomics, proteomics and metabolomics efforts rapidly generate a plethora of data on the activity and levels of biomolecules within mammalian cells. At the same time, curation projects that organize knowledge from the biomedical literature into online databases are expanding. Hence, there is a wealth of information about genes, proteins and their associations, with an urgent need for data integration to achieve better knowledge extraction and data reuse. For this purpose, we developed the Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins from over 70 major online resources. We extracted, abstracted and organized data into ∼72 million functional associations between genes/proteins and their attributes. Such attributes could be physical relationships with other biomolecules, expression in cell lines and tissues, genetic associations with knockout mouse or human phenotypes, or changes in expression after drug treatment. We stored these associations in a relational database along with rich metadata for the genes/proteins, their attributes and the original resources. The freely available Harmonizome web portal provides a graphical user interface, a web service and a mobile app for querying, browsing and downloading all of the collected data. To demonstrate the utility of the Harmonizome, we computed and visualized gene–gene and attribute–attribute similarity networks, and through unsupervised clustering, identified many unexpected relationships by combining pairs of datasets such as the association between kinase perturbations and disease signatures. We also applied supervised machine learning methods to predict novel substrates for kinases, endogenous ligands for G-protein coupled receptors, mouse phenotypes for knockout genes, and classified unannotated transmembrane proteins for likelihood of being ion channels. The Harmonizome is a comprehensive resource of knowledge about genes and proteins, and as such, it enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation. Database URL: http://amp.pharm.mssm.edu/Harmonizome.

[1]  Joshua M. Stuart,et al.  Subtype and pathway specific responses to anticancer compounds in breast cancer , 2011, Proceedings of the National Academy of Sciences.

[2]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[3]  B. Kuster,et al.  Mass-spectrometry-based draft of the human proteome , 2014, Nature.

[4]  T. Conroy,et al.  Sorafenib and irinotecan (NEXIRI) as second- or later-line treatment for patients with metastatic colorectal cancer and KRAS-mutated tumours: a multicentre Phase I/II trial , 2014, British Journal of Cancer.

[5]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Ioannis Xenarios,et al.  DIP: The Database of Interacting Proteins: 2001 update , 2001, Nucleic Acids Res..

[7]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[8]  Avi Ma'ayan,et al.  Lean Big Data integration in systems biology and systems pharmacology. , 2014, Trends in pharmacological sciences.

[9]  Allan R. Jones,et al.  The Allen Brain Atlas: 5 years and beyond , 2009, Nature Reviews Neuroscience.

[10]  Baris E. Suzek,et al.  The Universal Protein Resource (UniProt) in 2010 , 2009, Nucleic Acids Res..

[11]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[12]  Avi Ma'ayan,et al.  Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool , 2013, BMC Bioinformatics.

[13]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[14]  Mingming Jia,et al.  COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer , 2009, Nucleic Acids Res..

[15]  Bin Zhang,et al.  PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse , 2011, Nucleic Acids Res..

[16]  Allan R. Jones,et al.  An anatomically comprehensive atlas of the adult human brain transcriptome , 2012, Nature.

[17]  Avi Ma'ayan,et al.  Genes2WordCloud: a quick way to identify biological themes from gene lists and free text , 2011, Source Code for Biology and Medicine.

[18]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[19]  Aedín C. Culhane,et al.  GeneSigDB: a manually curated database and resource for analysis of gene expression signatures , 2011, Nucleic Acids Res..

[20]  Christian Stolte,et al.  Comprehensive comparison of large-scale tissue expression datasets , 2015, bioRxiv.

[21]  Thomas D. Wu,et al.  A comprehensive transcriptional portrait of human cancer cell lines , 2014, Nature Biotechnology.

[22]  Dipanwita Roy Chowdhury,et al.  Human protein reference database as a discovery resource for proteomics , 2004, Nucleic Acids Res..

[23]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[24]  Chris T. A. Evelo,et al.  Bioinformatics Applications Note Databases and Ontologies Go-elite: a Flexible Solution for Pathway and Ontology Over-representation , 2022 .

[25]  Janan T. Eppig,et al.  The Mammalian Phenotype Ontology as a unifying standard for experimental and high-throughput phenotyping data , 2012, Mammalian Genome.

[26]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[27]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[28]  Tin Wee Tan,et al.  Towards BioDBcore: a community-defined information specification for biological databases , 2010, Database J. Biol. Databases Curation.

[29]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[30]  S. Ramaswamy,et al.  Systematic identification of genomic markers of drug sensitivity in cancer cells , 2012, Nature.

[31]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[32]  Avi Ma'ayan,et al.  Genes2FANs: connecting genes through functional association networks , 2011, BMC Bioinformatics.

[33]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[34]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease , 2014, Nucleic Acids Res..

[35]  R. Tibshirani,et al.  Disease signatures are robust across tissues and experiments , 2009, Molecular systems biology.

[36]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[37]  Michael A. Langston,et al.  GeneWeaver: a web-based system for integrative functional genomics , 2011, Nucleic Acids Res..

[38]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[39]  Allan R. Jones,et al.  Genome-wide atlas of gene expression in the adult mouse brain , 2007, Nature.

[40]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[41]  Benito Munoz,et al.  Towards patient-based cancer therapeutics , 2010, Nature Biotechnology.

[42]  Andrew D. Rouillard,et al.  LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures , 2014, Nucleic Acids Res..

[43]  F. Kaye,et al.  Molecular genetic characterization of neuroendocrine lung cancer cell lines. , 1995, Anticancer research.

[44]  I. Pollack,et al.  Dinaciclib, a Cyclin-Dependent Kinase Inhibitor Promotes Proteasomal Degradation of Mcl-1 and Enhances ABT-737–Mediated Cell Death in Malignant Human Glioma Cell Lines , 2016, The Journal of Pharmacology and Experimental Therapeutics.

[45]  Aedín C. Culhane,et al.  GeneSigDB—a curated database of gene expression signatures , 2009, Nucleic Acids Res..

[46]  Judith A. Blake,et al.  The Mouse Genome Database: integration of and access to knowledge about the laboratory mouse , 2013, Nucleic Acids Res..

[47]  Joanna L. Sharman,et al.  The IUPHAR/BPS Guide to PHARMACOLOGY: an expert-driven knowledgebase of drug targets and their ligands , 2013, Nucleic Acids Res..

[48]  M. Khoury,et al.  A navigator for human genome epidemiology , 2008, Nature Genetics.

[49]  François Schiettecatte,et al.  OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders , 2014, Nucleic Acids Res..

[50]  Nicolas Le Novère,et al.  Identifiers.org and MIRIAM Registry: community resources to provide persistent identification , 2011, Nucleic Acids Res..

[51]  Michele Tinti,et al.  VirusMINT: a viral protein interaction database , 2008, Nucleic Acids Res..

[52]  Avi Ma'ayan,et al.  ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments , 2010, Bioinform..

[53]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[54]  G. Bepler,et al.  Establishment and identification of small cell lung cancer cell lines having classic and variant features. , 1985, Cancer research.

[55]  John Wilbanks,et al.  'Omics Data Sharing , 2009, Science.

[56]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[57]  Gary D Bader,et al.  BIND--The Biomolecular Interaction Network Database. , 2001, Nucleic acids research.

[58]  Allan R. Jones,et al.  The Allen Human Brain Atlas Comprehensive gene expression mapping of the human brain , 2012, Trends in Neurosciences.

[59]  Chunlei Wu,et al.  BioGPS and MyGene.info: organizing online, gene-centric information , 2012, Nucleic Acids Res..

[60]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[61]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[62]  M. Clausen,et al.  FDA-approved small-molecule kinase inhibitors. , 2015, Trends in pharmacological sciences.

[63]  Lincoln Stein,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Res..

[64]  J. Kornhauser,et al.  PhosphoSite: A bioinformatics resource dedicated to physiological protein phosphorylation , 2004, Proteomics.

[65]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[66]  Andrew D. Rouillard,et al.  Reprint of "Abstraction for data integration: Fusing mammalian molecular, cellular and phenotype big datasets for better knowledge extraction" , 2015, Comput. Biol. Chem..

[67]  Benjamin Haibe-Kains,et al.  Inconsistency in large pharmacogenomic studies , 2013, Nature.

[68]  J. Mesirov,et al.  Systematic investigation of genetic vulnerabilities across cancer cell lines reveals lineage-specific dependencies in ovarian cancer , 2011, Proceedings of the National Academy of Sciences.

[69]  David S. Wishart,et al.  HMDB 3.0—The Human Metabolome Database in 2013 , 2012, Nucleic Acids Res..

[70]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[71]  Allan R. Jones,et al.  Transcriptional Landscape of the Prenatal Human Brain , 2014, Nature.

[72]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[73]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[74]  Xun Li,et al.  The human DEPhOsphorylation database DEPOD: a 2015 update , 2014, Nucleic Acids Res..

[75]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[76]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[77]  Chris T. A. Evelo,et al.  WikiPathways: building research communities on biological pathways , 2011, Nucleic Acids Res..

[78]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[79]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[80]  S. Rabbani,et al.  SKI-606 (Bosutinib) Blocks Prostate Cancer Invasion, Growth, and Metastasis In vitro and In vivo through Regulation of Genes Involved in Cancer Growth and Skeletal Metastasis , 2010, Molecular Cancer Therapeutics.

[81]  David J. Arenillas,et al.  JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles , 2013, Nucleic Acids Res..

[82]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.

[83]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[84]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[85]  Tatiana A. Tatusova,et al.  Gene: a gene-centered information resource at NCBI , 2014, Nucleic Acids Res..

[86]  Avi Ma'ayan,et al.  KEA: kinase enrichment analysis , 2009, Bioinform..

[87]  M. Borad,et al.  Phase I Study of Bosutinib, a Src/Abl Tyrosine Kinase Inhibitor, Administered to Patients with Advanced Solid Tumors , 2011, Clinical Cancer Research.

[88]  Xiang Zhang,et al.  Tools for efficient epistasis detection in genome-wide association study , 2010, Source Code for Biology and Medicine.

[89]  Xin Chen,et al.  Allosteric ligands for the pharmacologically dark receptors GPR68 and GPR65 , 2015, Nature.

[90]  Gunnar Kleinau,et al.  Identification of GPR83 as the receptor for the neuroendocrine peptide PEN , 2016, Science Signaling.

[91]  Jeffrey J Clare,et al.  Targeting ion channels for drug discovery. , 2010, Discovery medicine.

[92]  Gary D Bader,et al.  A draft map of the human proteome , 2014, Nature.

[93]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[94]  Doriano Fabbro,et al.  Ten things you should know about protein kinases: IUPHAR Review 14 , 2015, British journal of pharmacology.

[95]  Gary D. Bader,et al.  Pathway Commons, a web resource for biological pathway data , 2010, Nucleic Acids Res..

[96]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[97]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[98]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[99]  Christian Stolte,et al.  COMPARTMENTS: unification and visualization of protein subcellular localization evidence , 2014, Database J. Biol. Databases Curation.

[100]  C. Burge,et al.  Most mammalian mRNAs are conserved targets of microRNAs. , 2008, Genome research.

[101]  Joyce A. Mitchell,et al.  Gene Indexing: Characterization and Analysis of NLM's GeneRIFs , 2003, AMIA.

[102]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[103]  Pak Chung Sham,et al.  GWASdb: a database for human genetic variants identified by genome-wide association studies , 2011, Nucleic Acids Res..

[104]  D. Bartel,et al.  Weak Seed-Pairing Stability and High Target-Site Abundance Decrease the Proficiency of lsy-6 and Other miRNAs , 2011, Nature Structural &Molecular Biology.

[105]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[106]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[107]  Huaiyu Mi,et al.  The InterPro protein families database: the classification resource after 15 years , 2014, Nucleic Acids Res..

[108]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[109]  Janan T Eppig,et al.  The mammalian phenotype ontology: enabling robust annotation and comparative analysis , 2009, Wiley interdisciplinary reviews. Systems biology and medicine.

[110]  Hans-Werner Mewes,et al.  CORUM: the comprehensive resource of mammalian protein complexes , 2007, Nucleic Acids Res..

[111]  Anushya Muruganujan,et al.  PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification , 2003, Nucleic Acids Res..

[112]  Chi-Ying F. Huang,et al.  miRTarBase: a database curates experimentally validated microRNA–target interactions , 2010, Nucleic Acids Res..

[113]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[114]  Ellen T. Gelfand,et al.  Parallel genome-scale loss of function screens in 216 cancer cell lines for the identification of context-specific genetic dependencies , 2014, Scientific Data.

[115]  Lydia Ng,et al.  Allen Brain Atlas: an integrated spatio-temporal portal for exploring the central nervous system , 2012, Nucleic Acids Res..

[116]  Gary D. Bader,et al.  The Biomolecular Interaction Network Database in PSI-MI 2.5 , 2011, Database J. Biol. Databases Curation.

[117]  Martin Vingron,et al.  IntAct: an open source molecular interaction database , 2004, Nucleic Acids Res..

[118]  Masato Kimura,et al.  NCBI’s Database of Genotypes and Phenotypes: dbGaP , 2013, Nucleic Acids Res..

[119]  Judith A. Blake,et al.  MGD: the Mouse Genome Database , 2003, Nucleic Acids Res..

[120]  R. M. Owen,et al.  Ion channels as therapeutic targets: a drug discovery perspective. , 2013, Journal of medicinal chemistry.

[121]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[122]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database's 10th year anniversary: update 2015 , 2014, Nucleic Acids Res..

[123]  Steve D. M. Brown,et al.  The International Mouse Phenotyping Consortium: past and future perspectives on mouse phenotyping , 2012, Mammalian Genome.

[124]  D. Chan,et al.  Analysis of the Human Endogenous Coregulator Complexome , 2011, Cell.

[125]  Lakshmi A. Devi,et al.  Advancements in therapeutically targeting orphan GPCRs , 2015, Front. Pharmacol..

[126]  Gary D Bader,et al.  PSICQUIC and PSISCORE: accessing and scoring molecular interactions , 2011, Nature Methods.

[127]  Hsien-Da Huang,et al.  miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions , 2013, Nucleic Acids Res..

[128]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[129]  Kalle Jonasson,et al.  Prediction of the human membrane proteome , 2010, Proteomics.

[130]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[131]  Avi Ma'ayan,et al.  Dynamics of the discovery process of protein-protein interactions from low content studies , 2015, BMC Systems Biology.

[132]  R. Hoffmann A wiki for the life sciences where authorship matters , 2008, Nature Genetics.

[133]  Avi Ma'ayan,et al.  ESCAPE: database for integrating high-content published data collected from human and mouse embryonic stem cells , 2013, Database J. Biol. Databases Curation.

[134]  Gary D Bader,et al.  The human genome and drug discovery after a decade. Roads (still) not taken , 2011, 1102.0448.

[135]  Jos Boekhorst,et al.  Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? , 2012, Briefings Bioinform..

[136]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[137]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[138]  L. Lim,et al.  MicroRNA targeting specificity in mammals: determinants beyond seed pairing. , 2007, Molecular cell.

[139]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[140]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[141]  Avi Ma'ayan,et al.  Sets2Networks: network inference from repeated observations of sets , 2012, BMC Systems Biology.

[142]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[143]  Kara Dolinski,et al.  The BioGRID interaction database: 2015 update , 2014, Nucleic Acids Res..

[144]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[145]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[146]  G. Kempermann Faculty Opinions recommendation of Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. , 2015 .

[147]  S. Naylor,et al.  Retention of chromosome 3 in extrapulmonary small cell cancer shown by molecular and cytogenetic studies. , 1989, Journal of the National Cancer Institute.

[148]  N. Campbell Genetic association database , 2004, Nature Reviews Genetics.

[149]  Nicholas A. Hamilton,et al.  LOCATE: a mammalian protein subcellular localization database , 2007, Nucleic Acids Res..

[150]  Thomas C. Wiegers,et al.  Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks , 2008, Nucleic Acids Res..

[151]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[152]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[153]  Ying Zhang,et al.  HMDB: the Human Metabolome Database , 2007, Nucleic Acids Res..

[154]  Bin Zhang,et al.  PhosphoSitePlus, 2014: mutations, PTMs and recalibrations , 2014, Nucleic Acids Res..

[155]  Hua Yu,et al.  Src activation in melanoma and Src inhibitors as therapeutic agents in melanoma , 2009, Melanoma research.

[156]  Henrik G Dohlman Thematic Minireview Series: New Directions in G Protein-coupled Receptor Pharmacology* , 2015, The Journal of Biological Chemistry.

[157]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..