CLEP: a hybrid data- and knowledge-driven framework for generating patient representations

As machine learning and artificial intelligence become more useful in the interpretation of biomedical data, their utility depends on the data used to train them. Due to the complexity and high dimensionality of biomedical data, there is a need for approaches that combine prior knowledge around known biological interactions with patient data. Here, we present CLEP, a novel approach that generates new patient representations by leveraging both prior knowledge and patient-level data. First, given a patient-level dataset and a knowledge graph containing relations across features that can be mapped to the dataset, CLEP incorporates patients into the knowledge graph as new nodes connected to their most characteristic features. Next, CLEP employs knowledge graph embedding models to generate new patient representations that can ultimately be used for a variety of downstream tasks, ranging from clustering to classification. We demonstrate how using new patient representations generated by CLEP significantly improves performance in classifying between patients and healthy controls for a variety of machine learning models, as compared to the use of the original transcriptomics data. Furthermore, we also show how incorporating patients into a knowledge graph can foster the interpretation and identification of biological features characteristic of a specific disease or patient subgroup. Finally, we released CLEP as an open source Python package together with examples and documentation.

[1]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[2]  S. Jackson,et al.  Machine learning and complex biological data , 2019, Genome Biology.

[3]  Kara Dolinski,et al.  The BioGRID interaction database: 2019 update , 2018, Nucleic Acids Res..

[4]  Chris Sander,et al.  Pathway Commons 2019 Update: integration, analysis and exploration of pathway data , 2019, Nucleic Acids Res..

[5]  Shraddha Pai,et al.  Patient Similarity Networks for Precision Medicine. , 2018, Journal of molecular biology.

[6]  Liqin Zhao,et al.  Estrogen receptor β in Alzheimer’s disease: From mechanisms to therapeutics , 2015, Ageing Research Reviews.

[7]  Jure Leskovec,et al.  Modeling polypharmacy side effects with graph convolutional networks , 2018, bioRxiv.

[8]  Volker Tresp,et al.  Bringing Light Into the Dark: A Large-Scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jian-Yun Nie,et al.  RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space , 2018, ICLR.

[10]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[11]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[12]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.

[13]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[14]  Martin Hofmann-Apitius,et al.  GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning , 2019, bioRxiv.

[15]  Jing Wang,et al.  Identification of key genes and pathways for Alzheimer’s disease via combined analysis of genome-wide expression profiling in the hippocampus , 2019, Biophysics Reports.

[16]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[17]  Olivier Stettler,et al.  The role of heparan sulfates in protein aggregation and their potential impact on neurodegeneration , 2018, FEBS letters.

[18]  Zulvikar Syambani Ulhaq,et al.  Estrogen receptor beta (ESR2) gene polymorphism and susceptibility to dementia , 2020, Acta Neurologica Belgica.

[19]  Gemma C. Garriga,et al.  Randomization Techniques for Graphs , 2009, SDM.

[20]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[21]  Holger Fröhlich,et al.  From hype to reality: data science enabling personalized medicine , 2018, BMC Medicine.

[22]  Wei Hu,et al.  BioSearch: a semantic search engine for Bio2RDF , 2017, Database J. Biol. Databases Curation.

[23]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Bo Wang,et al.  Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities , 2018, Inf. Fusion.

[26]  Holger Heine,et al.  Role of the Toll-Like Receptor 4 in Neuroinflammation in Alzheimer’s Disease , 2007, Cellular Physiology and Biochemistry.

[27]  Donghyeon Yu,et al.  Review of Biological Network Data and Its Applications , 2013, Genomics & informatics.

[28]  C. Jack,et al.  Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI) , 2005, Alzheimer's & Dementia.

[29]  Huda Akil,et al.  Inference of cell type content from human brain transcriptomic datasets illuminates the effects of age, manner of death, dissection, and psychiatric diagnosis , 2018, PloS one.

[30]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[31]  A. Goldenberg,et al.  Intertumoral Heterogeneity within Medulloblastoma Subgroups. , 2017, Cancer cell.

[32]  Daniel S. Himmelstein,et al.  Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes , 2014, bioRxiv.

[33]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[34]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[35]  Lorenzo Rosasco,et al.  Holographic Embeddings of Knowledge Graphs , 2015, AAAI.

[36]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[37]  Rafael C. Jimenez,et al.  The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases , 2013, Nucleic Acids Res..

[38]  Mateusz Maciejewski,et al.  Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data , 2020, BMC Bioinformatics.

[39]  R. Green,et al.  Genetic studies of quantitative MCI and AD phenotypes in ADNI: Progress, opportunities, and plans , 2015, Alzheimer's & Dementia.

[40]  Zhen Wang,et al.  Knowledge Graph Embedding by Translating on Hyperplanes , 2014, AAAI.

[41]  Ryan Miller,et al.  WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research , 2017, Nucleic Acids Res..

[42]  Lauric A. Ferrat,et al.  Logistic regression has similar performance to optimised machine learning algorithms in a clinical setting: application to the discrimination between type 1 and type 2 diabetes in young adults , 2020, Diagnostic and Prognostic Research.

[43]  Mohammad Asif Emon,et al.  Using Multi-Scale Genetic, Neuroimaging and Clinical Data for Predicting Alzheimer’s Disease and Reconstruction of Relevant Biological Mechanisms , 2018, Scientific Reports.

[44]  Ling Li,et al.  Role of toll-like receptor signalling in Abeta uptake and clearance. , 2006, Brain : a journal of neurology.

[45]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[46]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[47]  Steven J. M. Jones,et al.  Integrated Genomic Characterization of Pancreatic Ductal Adenocarcinoma. , 2017, Cancer cell.

[48]  Gregory L. Futia,et al.  Three dimensional two-photon brain imaging in freely moving mice using a miniature fiber coupled microscope with active axial-scanning , 2018, Scientific Reports.

[49]  Shraddha Pai,et al.  netDx: interpretable patient classification using integrated patient similarity networks , 2019, Molecular systems biology.

[50]  Martin Hofmann-Apitius,et al.  PathMe: Merging and exploring mechanistic pathway knowledge , 2019, BMC Bioinform..

[51]  Volker Tresp,et al.  PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings , 2020, J. Mach. Learn. Res..

[52]  Michael Gruenstaeudl,et al.  PACVr: plastome assembly coverage visualization in R , 2020, BMC Bioinformatics.