Using epigenomics data to predict gene expression in lung cancer

BackgroundEpigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking.MethodsA new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set.ResultsA best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.ConclusionsBy considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.

[1]  Thomas Lengauer,et al.  Computational epigenetics , 2008, Bioinform..

[2]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[3]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[4]  Shen Jean Lim,et al.  Computational Epigenetics: the new scientific paradigm , 2010, Bioinformation.

[5]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[6]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[7]  C. Allis,et al.  DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA , 2007, Nature.

[8]  Byoung-Tak Zhang,et al.  Integrated analysis of genome-wide DNA methylation and gene expression profiles in molecular subtypes of breast cancer , 2013, Nucleic acids research.

[9]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[10]  J. L. Paternáin,et al.  Specific gene hypomethylation and cancer: New insights into coding region feature trends , 2009, Bioinformation.

[11]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[12]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[13]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[14]  P. Laird Principles and challenges of genome-wide DNA methylation analysis , 2010, Nature Reviews Genetics.

[15]  Allen D. Delaney,et al.  Conserved Role of Intragenic DNA Methylation in Regulating Alternative Promoters , 2010, Nature.

[16]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[17]  J. Flanagan,et al.  Genome-wide hypomethylation in cancer may be a passive consequence of transformation. , 2010, Biochimica et biophysica acta.

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[20]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[21]  H. Cedar,et al.  Linking DNA methylation and histone modification: patterns and paradigms , 2009, Nature Reviews Genetics.

[22]  Tony Kouzarides,et al.  The Methyl-CpG-binding Protein MeCP2 Links DNA Methylation to Histone Methylation* , 2003, The Journal of Biological Chemistry.

[23]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[24]  M. Esteller,et al.  Epigenetic modifications and human disease , 2010, Nature Biotechnology.

[25]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[26]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[27]  Mondira Kundu,et al.  Integrated genetic and epigenetic analysis of childhood acute lymphoblastic leukemia. , 2013, The Journal of clinical investigation.

[28]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[29]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[30]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[31]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[32]  Peter A. Jones Functions of DNA methylation: islands, start sites, gene bodies and beyond , 2012, Nature Reviews Genetics.

[33]  N J Bowen,et al.  Chromosomal regulation by MeCP2: structural and enzymatic considerations. , 2004, Cellular and molecular life sciences : CMLS.