Weighted elastic net for unsupervised domain adaptation with application to age prediction from DNA methylation data

Abstract Motivation Predictive models are a powerful tool for solving complex problems in computational biology. They are typically designed to predict or classify data coming from the same unknown distribution as the training data. In many real-world settings, however, uncontrolled biological or technical factors can lead to a distribution mismatch between datasets acquired at different times, causing model performance to deteriorate on new data. A common additional obstacle in computational biology is scarce data with many more features than samples. To address these problems, we propose a method for unsupervised domain adaptation that is based on a weighted elastic net. The key idea of our approach is to compare dependencies between inputs in training and test data and to increase the cost of differently behaving features in the elastic net regularization term. In doing so, we encourage the model to assign a higher importance to features that are robust and behave similarly across domains. Results We evaluate our method both on simulated data with varying degrees of distribution mismatch and on real data, considering the problem of age prediction based on DNA methylation data across multiple tissues. Compared with a non-adaptive standard model, our approach substantially reduces errors on samples with a mismatched distribution. On real data, we achieve far lower errors on cerebellum samples, a tissue which is not part of the training data and poorly predicted by standard models. Our results demonstrate that unsupervised domain adaptation is possible for applications in computational biology, even with many more features than samples. Availability and implementation Source code is available at https://github.com/PfeiferLabTue/wenda. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Thomas Lengauer,et al.  Comprehensive Analysis of DNA Methylation Data with RnBeads , 2014, Nature Methods.

[2]  Takaya Saito,et al.  Target gene expression levels and competition between transfected and endogenous microRNAs are strong confounding factors in microRNA high-throughput experiments , 2012, Silence.

[3]  Nico Pfeifer,et al.  Interpretable Per Case Weighted Ensemble Method for Cancer Associations , 2014, WABI.

[4]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[5]  Andrew E. Teschendorff,et al.  Cell and tissue type independent age-associated DNA methylation changes are not rare but common , 2018 .

[6]  David Modiano,et al.  Resistance to malaria through structural variation of red blood cell invasion receptors , 2016, Science.

[7]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[8]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[9]  Rémi Emonet,et al.  Landmarks-based kernelized subspace alignment for unsupervised domain adaptation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  B. Stranger,et al.  Progress and Promise of Genome-Wide Association Studies for Human Complex Trait Genetics , 2011, Genetics.

[11]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[12]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[13]  Philipp Khaitovich,et al.  Aging and Gene Expression in the Primate Brain , 2005, PLoS biology.

[14]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  Mihaela van der Schaar,et al.  A Non-parametric Learning Method for Confidently Estimating Patient's Clinical State and Dynamics , 2016, NIPS.

[17]  Robin M. Murray,et al.  Epigenome-Wide Scans Identify Differentially Methylated Regions for Age and Age-Related Phenotypes in a Healthy Ageing Population , 2012, PLoS genetics.

[18]  Mehryar Mohri,et al.  Domain Adaptation in Regression , 2011, ALT.

[19]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[20]  Hermann Brenner,et al.  Cross-sectional and longitudinal changes in DNA methylation with age: an epigenome-wide analysis revealing over 60 novel age-associated CpG sites. , 2014, Human molecular genetics.

[21]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  H QUASTLER,et al.  GENETIC EFFECTS. , 1964, New York state journal of medicine.

[23]  Jeffrey T Leek,et al.  On the design and analysis of gene expression studies in human populations , 2007, Nature Genetics.

[24]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[25]  Alfonso Valencia,et al.  Distinct DNA methylomes of newborns and centenarians , 2012, Proceedings of the National Academy of Sciences.

[26]  D. Schübeler Function and information content of DNA methylation , 2015, Nature.

[27]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[28]  B. Christensen,et al.  Aging and Environmental Exposures Alter Tissue-Specific DNA Methylation Dependent upon CpG Island Context , 2009, PLoS genetics.

[29]  Timothy E. Reddy,et al.  Dynamic DNA methylation across diverse human cell lines and tissues , 2013, Genome research.

[30]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[31]  Richard M Myers,et al.  Differential DNA methylation with age displays both common and dynamic features across human tissues that are influenced by CpG landscape , 2013, Genome Biology.

[32]  Marcel H. Schulz,et al.  Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction , 2016, bioRxiv.

[33]  Francesco Marabita,et al.  A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data , 2012, Bioinform..

[34]  M. Daly,et al.  Genetic and Epigenetic Fine-Mapping of Causal Autoimmune Disease Variants , 2014, Nature.

[35]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[36]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[37]  Aviv Regev,et al.  Comparative analysis of gene regulatory networks: from network reconstruction to evolution. , 2015, Annual review of cell and developmental biology.

[38]  T. Ideker,et al.  Genome-wide methylation profiles reveal quantitative views of human aging rates. , 2013, Molecular cell.

[39]  A. Gnirke,et al.  Charting a dynamic DNA methylation landscape of the human genome , 2013, Nature.

[40]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[41]  Kristen Grauman,et al.  Connecting the Dots with Landmarks: Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation , 2013, ICML.

[42]  Bhanukiran Vinzamuri,et al.  Constrained elastic net based knowledge transfer for healthcare information exchange , 2014, Data Mining and Knowledge Discovery.

[43]  S. Horvath DNA methylation age of human tissues and cell types , 2013, Genome Biology.

[44]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[45]  Thomas Lengauer,et al.  Innovations: Bioinformatics-assisted anti-HIV therapy , 2006, Nature Reviews Microbiology.

[46]  Rama Chellappa,et al.  Visual Domain Adaptation: A survey of recent advances , 2015, IEEE Signal Processing Magazine.

[47]  Fabio Gagliardi Cozman,et al.  Random Generation of Bayesian Networks , 2002, SBIA.

[48]  S. Ramaswamy,et al.  Systematic identification of genomic markers of drug sensitivity in cancer cells , 2012, Nature.

[49]  Michael I. Jordan,et al.  Unsupervised Domain Adaptation with Residual Transfer Networks , 2016, NIPS.

[50]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[51]  Andrew E. Teschendorff,et al.  Age-associated epigenetic drift: implications, and a case of epigenetic thrift? , 2013, Human molecular genetics.

[52]  Christian Wachinger,et al.  Domain adaptation for Alzheimer's disease diagnostics , 2016, NeuroImage.

[53]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[54]  Atul J Butte,et al.  Robust meta-analysis of gene expression using the elastic net , 2015, Nucleic acids research.