Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA)

Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants and disease phenotypes. In this study, we used topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn patterns from electronic health record data. We chose the single nucleotide polymorphism (SNP) rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals with electronic health records (EHR) and linked DNA samples at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phenotypes and identified six topics. We tested their associations with rs10455872 in LPA. Topics enriched for CVD and hyperlipidemia had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic enriched for lung cancer (P < 0.001) which was not previously identified via phenome-wide scanning. We were able to replicate the top finding in a separate dataset. Our results demonstrate the applicability of topic modeling in exploring the relationship between genetic variants and clinical diseases.

[1]  Shaowen Yao,et al.  An overview of topic modeling and its current applications in bioinformatics , 2016, SpringerPlus.

[2]  C. Kendziorski,et al.  Extending Information Retrieval Methods to Personalized Genomic-Based Studies of Disease , 2014, Cancer informatics.

[3]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[4]  E. Boerwinkle,et al.  Genetic Risk, Adherence to a Healthy Lifestyle, and Coronary Disease. , 2016, The New England journal of medicine.

[5]  Gonçalo R. Abecasis,et al.  Minimac2: Faster Genotype Imputation , 2015, Bioinform..

[6]  Jonathan C. Cohen,et al.  Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. , 2006, The New England journal of medicine.

[7]  Gabriella Kazai,et al.  Advances in Information Retrieval , 2015, Lecture Notes in Computer Science.

[8]  T. McCoy,et al.  Efficient Genome-wide Association in Biobanks Using Topic Modeling Identifies Multiple Novel Disease Loci , 2017, Molecular medicine.

[9]  Derek Greene,et al.  How Many Topics? Stability Analysis for Topic Models , 2014, ECML/PKDD.

[10]  Michèle Sebag,et al.  Machine Learning and Knowledge Discovery in Databases , 2015, Lecture Notes in Computer Science.

[11]  Vincent Y. F. Tan,et al.  Automatic Relevance Determination in Nonnegative Matrix Factorization with the /spl beta/-Divergence , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Di Jiang,et al.  Dynamic multi-faceted topic discovery in twitter , 2013, CIKM.

[13]  Junghoo Cho,et al.  Social-network analysis using topic models , 2012, SIGIR '12.

[14]  Marco Masseroli,et al.  Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[15]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[16]  S. Heath,et al.  Association between a 15q25 gene variant, smoking quantity and tobacco-related cancers among 17 000 individuals. , 2010, International journal of epidemiology.

[17]  Jimeng Sun,et al.  Limestone: High-throughput candidate phenotype generation via tensor factorization , 2014, J. Biomed. Informatics.

[18]  José M. Bioucas-Dias,et al.  Estimation of signal subspace on hyperspectral data , 2005, SPIE Remote Sensing.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[21]  R. Collins,et al.  Genetic variants associated with Lp(a) lipoprotein level and coronary disease. , 2009, The New England journal of medicine.

[22]  Iuliana Ionita-Laza,et al.  FUN-LDA: A Latent Dirichlet Allocation Model for Predicting Tissue-Specific Functional Effects of Noncoding Variation: Methods and Applications. , 2018, American journal of human genetics.

[23]  D. Roden,et al.  The Influence of Big (Clinical) Data and Genomics on Precision Medicine and Drug Development , 2018, Clinical pharmacology and therapeutics.

[24]  Derek Greene,et al.  An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..

[25]  J. Borén,et al.  Lipoprotein(a) as a cardiovascular risk factor: current status , 2010, European heart journal.

[26]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[27]  James D. Wilson,et al.  Topic supervised non-negative matrix factorization , 2017, ArXiv.

[28]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[29]  Vikas Sindhwani,et al.  Rank Selection in Low-rank Matrix Approximations : A Study of Cross-Validation for NMFs , 2010 .

[30]  Huilong Duan,et al.  A probabilistic topic model for clinical risk stratification from electronic health records , 2015, J. Biomed. Informatics.

[31]  Teri A Manolio,et al.  Genomewide association studies and assessment of the risk of disease. , 2010, The New England journal of medicine.

[32]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[33]  Inderjit S. Dhillon,et al.  Generalized Nonnegative Matrix Approximations with Bregman Divergences , 2005, NIPS.

[34]  R. Ramakrishnan,et al.  Lipoprotein(a): an elusive cardiovascular risk factor. , 2004, Arteriosclerosis, thrombosis, and vascular biology.

[35]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[36]  Gil Alterovitz,et al.  Seeing the forest through the trees: uncovering phenomic complexity through interactive network visualization , 2015, J. Am. Medical Informatics Assoc..

[37]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[38]  Christian Bauckhage,et al.  Plant Phenotyping using Probabilistic Topic Models: Uncovering the Hyperspectral Language of Plants , 2016, Scientific Reports.

[39]  Gunnar Rätsch,et al.  An Empirical Analysis of Topic Modeling for Mining Cancer Clinical Notes , 2013, bioRxiv.

[40]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[41]  Eric Boerwinkle,et al.  Sequence Variations in PCSK 9 , Low LDL , and Protection against Coronary Heart Disease , 2006 .

[42]  R. Collins,et al.  Multiple QTL influence the serum Lp(a) concentration: a genome-wide linkage screen in the PROCARDIS study , 2007, European Journal of Human Genetics.

[43]  T. McCoy,et al.  Polygenic loading for major depression is associated with specific medical comorbidity , 2017, Translational Psychiatry.

[44]  George Hripcsak,et al.  LPA Variants Are Associated With Residual Cardiovascular Risk in Patients Receiving Statins , 2018, Circulation.

[45]  E. Boerwinkle,et al.  Effects of the apolipoprotein(a) size polymorphism on the lipoprotein(a) concentration in 7 ethnic groups , 1991, Human Genetics.

[46]  N. Cox,et al.  Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record , 2017, PloS one.

[47]  Wong Tin Wui,et al.  Enhancement of the production of L-glutaminase, an anticancer enzyme, from Aeromonas veronii by adaptive and induced mutation techniques , 2017, PloS one.

[48]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[49]  K. Kayaba,et al.  Low Lipoprotein(a) Concentration Is Associated with Cancer and All-Cause Deaths: A Population-Based Cohort Study (The JMS Cohort Study) , 2012, PloS one.

[50]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[51]  P A Martin,et al.  Chromosomal rearrangements in three generations of a Jamaican family. A possible further example of recombinational imbalance. , 1970, Cytogenetics.

[52]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[53]  Hesham Hassan,et al.  On the Significance of Fuzzification of the N and M in Cancer Staging , 2014, Cancer informatics.

[54]  Lorenzo Bruzzone Image and Signal Processing for Remote Sensing XI , 2004 .