A pitfall for machine learning methods aiming to predict across cell types

Machine learning models used to predict phenomena such as gene expression, enhancer activity, transcription factor binding, or chromatin conformation are most useful when they can generalize to make accurate predictions across cell types. In this situation, a natural strategy is to train the model on experimental data from some cell types and evaluate performance on one or more held-out cell types. In this work, we show that when the training set contains examples derived from the same genomic loci across multiple cell types, the resulting model can be susceptible to a particular form of bias related to memorizing the average activity associated with each genomic locus. Consequently, the trained model may appear to perform well when evaluated on the genomic loci that it was trained on but tends to perform poorly on loci that it was not trained on. We demonstrate this phenomenon by using epigenomic measurements and nucleotide sequence to predict gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data and computing resources become available, future projects will increasingly risk suffering from this issue.

[1]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[2]  Chengbang Huang,et al.  Predicting Protein-Protein Interactions from Protein Domains Using a Set Cover Approach , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  B. Ren,et al.  Genome-wide prediction of transcription factor binding sites using an integrated model , 2010, Genome Biology.

[4]  Manolis Kellis,et al.  Discovery and characterization of chromatin states for systematic annotation of the human genome , 2010, Nature Biotechnology.

[5]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[6]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[7]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[8]  Michael Fernández,et al.  Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machines , 2012, Nucleic acids research.

[9]  Jie Wang,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2013, BCB.

[10]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[11]  Katherine S. Pollard,et al.  Integrating Diverse Datasets Improves Developmental Enhancer Prediction , 2013, PLoS Comput. Biol..

[12]  Correction: Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[13]  E. Marco,et al.  Predicting chromatin organization using histone marks , 2015, Genome Biology.

[14]  Yiming Lu,et al.  DELTA: A Distal Enhancer Locating Tool Based on AdaBoost Algorithm and Shape Features of Chromatin Modifications , 2015, PloS one.

[15]  V. Bajic,et al.  DEEP: a general computational framework for predicting enhancers , 2014, Nucleic acids research.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.

[18]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[19]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[20]  Anthony D. Schmitt,et al.  A Compendium of Chromatin Contact Maps Reveals Spatially Active Regions in the Human Genome. , 2016, Cell reports.

[21]  Yanjun Qi,et al.  DeepChrome: deep-learning for predicting gene expression from histone modifications , 2016, Bioinform..

[22]  Ananth Grama,et al.  EP-DNN: A Deep Neural Network-Based Global Enhancer Prediction Algorithm , 2016, Scientific Reports.

[23]  Michele Di Pierro,et al.  De Novo Prediction of Human Chromosome Structures: Epigenetic Marking Patterns Encode Genome Architecture , 2017, bioRxiv.

[24]  Erez Lieberman Aiden,et al.  De novo prediction of human chromosome structures: Epigenetic marking patterns encode genome architecture , 2017, Proceedings of the National Academy of Sciences.

[25]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[26]  Hongkai Ji,et al.  Genome-wide prediction of DNase I hypersensitivity using gene expression , 2017, Nature Communications.

[27]  D. Dickel,et al.  Improved regulatory element prediction based on tissue-specific local epigenomic signatures , 2017, Proceedings of the National Academy of Sciences.

[28]  Erratum to: DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2017, Genome Biology.

[29]  D. Ucar,et al.  A neural network based model effectively predicts enhancers from clinical ATAC-seq samples , 2018, Scientific Reports.

[30]  W. Wasserman,et al.  Genome-wide prediction of cis-regulatory regions using supervised deep learning methods , 2016, BMC Bioinformatics.

[31]  Arshdeep Sekhon,et al.  Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin , 2017, bioRxiv.

[32]  William Stafford Noble,et al.  PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition , 2018, Nature Communications.

[33]  Daniel S. Kim,et al.  Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts , 2019, bioRxiv.