CLEP: a hybrid data- and knowledge-driven framework for generating patient representations

Abstract Summary As machine learning and artificial intelligence increasingly attain a larger number of applications in the biomedical domain, at their core, their utility depends on the data used to train them. Due to the complexity and high dimensionality of biomedical data, there is a need for approaches that combine prior knowledge around known biological interactions with patient data. Here, we present CLinical Embedding of Patients (CLEP), a novel approach that generates new patient representations by leveraging both prior knowledge and patient-level data. First, given a patient-level dataset and a knowledge graph containing relations across features that can be mapped to the dataset, CLEP incorporates patients into the knowledge graph as new nodes connected to their most characteristic features. Next, CLEP employs knowledge graph embedding models to generate new patient representations that can ultimately be used for a variety of downstream tasks, ranging from clustering to classification. We demonstrate how using new patient representations generated by CLEP significantly improves performance in classifying between patients and healthy controls for a variety of machine learning models, as compared to the use of the original transcriptomics data. Furthermore, we also show how incorporating patients into a knowledge graph can foster the interpretation and identification of biological features characteristic of a specific disease or patient subgroup. Finally, we released CLEP as an open source Python package together with examples and documentation. Availability and implementation CLEP is available to the bioinformatics community as an open source Python package at https://github.com/hybrid-kg/clep under the Apache 2.0 License. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Mateusz Maciejewski,et al.  Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data , 2020, BMC Bioinformatics.

[2]  Olivier Stettler,et al.  The role of heparan sulfates in protein aggregation and their potential impact on neurodegeneration , 2018, FEBS letters.

[3]  C. Jack,et al.  Alzheimer's Disease Neuroimaging Initiative , 2008 .

[4]  Bo Wang,et al.  Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities , 2018, Inf. Fusion.

[5]  Jure Leskovec,et al.  Modeling polypharmacy side effects with graph convolutional networks , 2018, bioRxiv.

[6]  Daniel S. Himmelstein,et al.  Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes , 2014, bioRxiv.

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  Jian-Yun Nie,et al.  RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space , 2018, ICLR.

[9]  Lauric A. Ferrat,et al.  Logistic regression has similar performance to optimised machine learning algorithms in a clinical setting: application to the discrimination between type 1 and type 2 diabetes in young adults , 2020, Diagnostic and Prognostic Research.

[10]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[11]  Jing Wang,et al.  Identification of key genes and pathways for Alzheimer’s disease via combined analysis of genome-wide expression profiling in the hippocampus , 2019, Biophysics Reports.

[12]  Wei Hu,et al.  BioSearch: a semantic search engine for Bio2RDF , 2017, Database J. Biol. Databases Curation.

[13]  Steven J. M. Jones,et al.  Integrated Genomic Characterization of Pancreatic Ductal Adenocarcinoma. , 2017, Cancer cell.

[14]  Rafael C. Jimenez,et al.  The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases , 2013, Nucleic Acids Res..

[15]  Lorenzo Rosasco,et al.  Holographic Embeddings of Knowledge Graphs , 2015, AAAI.

[16]  Holger Fröhlich,et al.  From hype to reality: data science enabling personalized medicine , 2018, BMC Medicine.

[17]  S. Jackson,et al.  Machine learning and complex biological data , 2019, Genome Biology.

[18]  Martin Hofmann-Apitius,et al.  PathMe: merging and exploring mechanistic pathway knowledge , 2018, BMC Bioinformatics.

[19]  Ryan Miller,et al.  WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research , 2017, Nucleic Acids Res..

[20]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[21]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[22]  Holger Heine,et al.  Role of the Toll-Like Receptor 4 in Neuroinflammation in Alzheimer’s Disease , 2007, Cellular Physiology and Biochemistry.

[23]  Volker Tresp,et al.  Bringing Light Into the Dark: A Large-Scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[25]  Mohammad Asif Emon,et al.  Using Multi-Scale Genetic, Neuroimaging and Clinical Data for Predicting Alzheimer’s Disease and Reconstruction of Relevant Biological Mechanisms , 2018, Scientific Reports.

[26]  Zhen Wang,et al.  Knowledge Graph Embedding by Translating on Hyperplanes , 2014, AAAI.

[27]  A. Goldenberg,et al.  Intertumoral Heterogeneity within Medulloblastoma Subgroups. , 2017, Cancer cell.

[28]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[29]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[30]  Martin Hofmann-Apitius,et al.  GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. , 2020, IEEE/ACM transactions on computational biology and bioinformatics.

[31]  Huda Akil,et al.  Inference of cell type content from human brain transcriptomic datasets illuminates the effects of age, manner of death, dissection, and psychiatric diagnosis , 2018, PloS one.

[32]  Sen Wang,et al.  SMR: Medical Knowledge Graph Embedding for Safe Medicine Recommendation , 2020, Big Data Res..

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Shraddha Pai,et al.  netDx: interpretable patient classification using integrated patient similarity networks , 2019, Molecular systems biology.

[35]  C. Jack,et al.  Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI) , 2005, Alzheimer's & Dementia.

[36]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.

[37]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[38]  Volker Tresp,et al.  PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings , 2020, J. Mach. Learn. Res..

[39]  Kara Dolinski,et al.  The BioGRID interaction database: 2019 update , 2018, Nucleic Acids Res..

[40]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[41]  Chris Sander,et al.  Pathway Commons 2019 Update: integration, analysis and exploration of pathway data , 2019, Nucleic Acids Res..

[42]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[43]  Shraddha Pai,et al.  Patient Similarity Networks for Precision Medicine. , 2018, Journal of molecular biology.

[44]  Zulvikar Syambani Ulhaq,et al.  Estrogen receptor beta (ESR2) gene polymorphism and susceptibility to dementia , 2020, Acta Neurologica Belgica.

[45]  Xiaochun Yin,et al.  Patient Similarity via Joint Embeddings of Medical Knowledge Graph and Medical Entity Descriptions , 2020, IEEE Access.

[46]  Donghyeon Yu,et al.  Review of Biological Network Data and Its Applications , 2013, Genomics & informatics.

[47]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[48]  R. Green,et al.  Genetic studies of quantitative MCI and AD phenotypes in ADNI: Progress, opportunities, and plans , 2015, Alzheimer's & Dementia.

[49]  Liqin Zhao,et al.  Estrogen receptor β in Alzheimer’s disease: From mechanisms to therapeutics , 2015, Ageing Research Reviews.

[50]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[51]  Ling Li,et al.  Role of toll-like receptor signalling in Abeta uptake and clearance. , 2006, Brain : a journal of neurology.