Knowledge-guided analysis of "omics" data using the KnowEnG cloud platform

We present Knowledge Engine for Genomics (KnowEnG), a free-to-use computational system for analysis of genomics data sets, designed to accelerate biomedical discovery. It includes tools for popular bioinformatics tasks such as gene prioritization, sample clustering, gene set analysis, and expression signature analysis. The system specializes in “knowledge-guided” data mining and machine learning algorithms, in which user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge bases and encoded in a massive “Knowledge Network.” KnowEnG adheres to “FAIR” principles (findable, accessible, interoperable, and reuseable): its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution, and are interoperable with other computing platforms. The analysis tools are made available through multiple access modes, including a web portal with specialized visualization modules. We demonstrate the KnowEnG system’s potential value in democratization of advanced tools for the modern genomics era through several case studies that use its tools to recreate and expand upon the published analysis of cancer data sets.

[1]  Jian Ma,et al.  A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression , 2014, BMC Bioinformatics.

[2]  Matthew E. Hudson,et al.  Genomic signatures of evolutionary transitions from solitary to group living , 2015, Science.

[3]  Jimeng Sun,et al.  Neighborhood formation and anomaly detection in bipartite graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[4]  S. Sinha,et al.  An epithelial-mesenchymal-amoeboid transition gene signature reveals molecular subtypes of breast cancer progression and metastasis , 2017, bioRxiv.

[5]  Alex J. Cornish,et al.  SANTA: Quantifying the Functional Content of Molecular Networks , 2014, PLoS Comput. Biol..

[6]  Mingming Jia,et al.  COSMIC: somatic cancer genetics at high-resolution , 2016, Nucleic Acids Res..

[7]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[8]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[9]  Jing Chen,et al.  NDEx, the Network Data Exchange. , 2015, Cell systems.

[10]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[11]  Mariano J. Alvarez,et al.  Genome-wide Identification of Post-translational Modulators of Transcription Factor Activity in Human B-Cells , 2009, Nature Biotechnology.

[12]  G. Sauter,et al.  Estrogen receptor alpha (ESR1) gene amplification is frequent in breast cancer , 2007, Nature Genetics.

[13]  Guoxian Yu,et al.  Network-aided Bi-Clustering for discovering cancer subtypes , 2017, Scientific Reports.

[14]  Elena Marchiori,et al.  Graph clustering with local search optimization: the resolution bias of the objective function matters most. , 2013, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[16]  S. Robert,et al.  Glutamate transporters in the biology of malignant gliomas , 2013, Cellular and Molecular Life Sciences.

[17]  Andrew M. Gross,et al.  Network-based stratification of tumor mutations , 2013, Nature Methods.

[18]  Bonnie Berger,et al.  Exploiting ontology graph for predicting sparsely annotated gene function , 2015, Bioinform..

[19]  Benjamin J. Raphael,et al.  Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin , 2014, Cell.

[20]  Julien F. Ayroles,et al.  Individual Variation in Pheromone Response Correlates with Reproductive Traits and Brain Gene Expression in Worker Honey Bees , 2010, PloS one.

[21]  Wen-Hui Wang,et al.  An Integrated Approach for Identifying Molecular Subtypes in Human Colon Cancer Using Gene Expression Data , 2018, Genes.

[22]  Alex A. T. Bui,et al.  Envisioning the future of 'big data' biomedicine , 2017, J. Biomed. Informatics.

[23]  Jens Nielsen,et al.  Type 2 diabetes and obesity induce similar transcriptional reprogramming in human myocytes , 2017, Genome Medicine.

[24]  Chris Anderson AZ Partners with DNAnexus for 2 Million Patient Sequencing Project , 2017 .

[25]  Abhinav Nellore,et al.  Cloud computing for genomic data analysis and collaboration , 2018, Nature Reviews Genetics.

[26]  J. P. Hou,et al.  DawnRank: discovering personalized driver genes in cancer , 2014, Genome Medicine.

[27]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.

[28]  Mehmet Koyutürk,et al.  Vavien: An Algorithm for Prioritizing Candidate Disease Genes Based on Topological Similarity of Proteins in Interaction Networks , 2011, J. Comput. Biol..

[29]  Joseph M. Troy,et al.  Cross‐species systems analysis of evolutionary toolkits of neurogenomic response to social challenge , 2018, Genes, brain, and behavior.

[30]  J. Szumiło,et al.  Expression of syndecan-1 and cathepsins D and K in advanced esophageal squamous cell carcinoma. , 2010, Folia histochemica et cytobiologica.

[31]  L. Saal,et al.  Refinement of breast cancer molecular classification by miRNA expression profiles , 2019, BMC Genomics.

[32]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[33]  G. Sherlock,et al.  The prognostic role of a gene signature from tumorigenic breast-cancer cells. , 2007, The New England journal of medicine.

[34]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[35]  Benjamin J. Raphael,et al.  Integrated genomic characterization of oesophageal carcinoma , 2017, Nature.

[36]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[37]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[38]  Juliana Costa-Silva,et al.  RNA-Seq differential expression analysis: An extended review and a software tool , 2017, PloS one.

[39]  Thawfeek M. Varusai,et al.  The Reactome Pathway Knowledgebase , 2017, Nucleic acids research.

[40]  Roy H. Campbell,et al.  Toward Scalable Machine Learning and Data Mining: the Bioinformatics Case , 2017, ArXiv.

[41]  N. Hu,et al.  Genomic Landscape of Somatic Alterations in Esophageal Squamous Cell Carcinoma and Gastric Cancer. , 2016, Cancer research.

[42]  L. Hubert,et al.  Comparing partitions , 1985 .

[43]  F. Supek,et al.  MUFFINN: cancer gene discovery via network analysis of somatic mutation data , 2016, Genome Biology.

[44]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[45]  T. Nussbaumer,et al.  TraitCorr – correlating gene expression measurements with phenotypic data , 2019, bioRxiv.

[46]  Jinming Yu,et al.  Nrf2 and Keap1 abnormalities in esophageal squamous cell carcinoma and association with the effect of chemoradiotherapy , 2018, Thoracic cancer.

[47]  Wei Pan,et al.  Network‐Based Penalized Regression With Application to Genomic Data , 2013, Biometrics.

[48]  G. Getz,et al.  Inferring tumour purity and stromal and immune cell admixture from expression data , 2013, Nature Communications.

[49]  Michael P. Schroeder,et al.  In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities. , 2015, Cancer cell.

[50]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[51]  Liang Song,et al.  Overexpression of FOXM1 as a target for malignant progression of esophageal squamous cell carcinoma. , 2018, Oncology letters.

[52]  A. Motter,et al.  Predicting growth rate from gene expression , 2018, Proceedings of the National Academy of Sciences.

[53]  Mariano J. Alvarez,et al.  Network-based inference of protein activity helps functionalize the genetic landscape of cancer , 2016, Nature Genetics.

[54]  Michael A. Langston,et al.  GeneWeaver: a web-based system for integrative functional genomics , 2011, Nucleic Acids Res..

[55]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[56]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[57]  W. Xiao,et al.  NETBAGs: a network-based clustering approach with gene signatures for cancer subtyping analysis. , 2015, Biomarkers in medicine.

[58]  G. Robinson,et al.  Gene Expression Profiles in the Brain Predict Behavior in Individual Honey Bees , 2003, Science.

[59]  Krishna R. Kalari,et al.  Knowledge-guided gene prioritization reveals new insights into the mechanisms of chemoresistance , 2016, Genome Biology.

[60]  James T. Robinson,et al.  Integrative genomic analysis by interoperation of bioinformatics tools in GenomeSpace , 2015, Nature Methods.

[61]  A. Sethi,et al.  The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research. , 2017, Cancer research.

[62]  Jun S. Liu,et al.  Comprehensive analyses of tumor immunity: implications for cancer immunotherapy , 2016, Genome Biology.

[63]  Huiru Zheng,et al.  Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application , 2019, Briefings Bioinform..

[64]  Masatomo Kobayashi,et al.  A model system to screen for candidate plant activators using an immune-induction system in Arabidopsis , 2006 .

[65]  George M. Spyrou,et al.  Discovering gene re-ranking efficiency and conserved gene-gene relationships derived from gene co-expression network analysis on breast cancer data , 2016, Scientific Reports.

[66]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[67]  Christopher R. Cabanski,et al.  Lung Squamous Cell Carcinoma mRNA Expression Subtypes Are Reproducible, Clinically Important, and Correspond to Normal Cell Types , 2010, Clinical Cancer Research.

[68]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[69]  Mariano J. Alvarez,et al.  A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers , 2010, Molecular systems biology.

[70]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[71]  Aris Floratos,et al.  geWorkbench: an open source platform for integrative genomics , 2010, Bioinform..

[72]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[73]  Avi Ma'ayan,et al.  Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool , 2013, BMC Bioinformatics.

[74]  L. Staudt,et al.  The NCI Genomic Data Commons as an engine for precision medicine. , 2017, Blood.

[75]  Andrey Alexeyenko,et al.  Network enrichment analysis: extension of gene-set enrichment analysis to gene networks , 2012, BMC Bioinformatics.

[76]  Quaid Morris,et al.  Combining many interaction networks to predict gene function and analyze gene lists , 2012, Proteomics.

[77]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[78]  Charles Blatti,et al.  Gene Sets Analysis using Network Patterns , 2019, bioRxiv.

[79]  Benjamin E. Gross,et al.  Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal , 2013, Science Signaling.

[80]  Mengjie Yan,et al.  The role of platelets in the tumor microenvironment: From solid tumors to leukemia. , 2016, Biochimica et biophysica acta.

[81]  Piero Fariselli,et al.  NET-GE: a novel NETwork-based Gene Enrichment for detecting biological processes associated to Mendelian diseases , 2015, BMC Genomics.

[82]  Joshua A. Bittker,et al.  Correlating chemical sensitivity and basal gene expression reveals mechanism of action , 2015, Nature chemical biology.

[83]  Anthony A. Philippakis,et al.  FireCloud, a scalable cloud-based platform for collaborative genome analysis: Strategies for reducing and controlling costs , 2017, bioRxiv.

[84]  Saurabh Sinha,et al.  Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks , 2016, Bioinform..

[85]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[86]  John Quackenbush,et al.  Cancer subtype identification using somatic mutation data , 2017, British Journal of Cancer.

[87]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[88]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[89]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[90]  Antoine M. van Oijen,et al.  Real-time single-molecule observation of rolling-circle DNA replication , 2009, Nucleic acids research.

[91]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[92]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[93]  F. He,et al.  Biased random walk model for the prioritization of drug resistance associated proteins , 2015, Scientific Reports.

[94]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[95]  Pietro Liò,et al.  Network regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer , 2015, Comput. Biol. Chem..

[96]  Baolin Wu,et al.  Network-based Survival Analysis Reveals Subnetwork Signatures for Predicting Outcomes of Ovarian Cancer Treatment , 2013, PLoS Comput. Biol..

[97]  Giorgio Valentini,et al.  An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods , 2014, Artif. Intell. Medicine.

[98]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[99]  M. Schmelzle,et al.  Esophageal cancer proliferation is mediated by cytochrome P450 2C9 (CYP2C9). , 2011, Prostaglandins & other lipid mediators.

[100]  Donna K. Slonim,et al.  Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases , 2018 .

[101]  Weidong Tian,et al.  LEGO: a novel method for gene set over-representation analysis by incorporating network-based gene weights , 2016, Scientific Reports.

[102]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.