A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform

BackgroundThe 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relatively little is known as to how best to perform feature selection or classification in the context of Illumina Infinium methylation data. Given the rising importance of epigenomics in cancer and other complex genetic diseases, and in view of the upcoming epigenome wide association studies, it is critical to identify the statistical methods that offer improved inference in this novel context.ResultsUsing a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing over 1,000 samples from a wide range of tissues, we here provide an evaluation of popular feature selection, dimensional reduction and classification methods on DNA methylation data. Specifically, we evaluate the effects of variance filtering, supervised principal components (SPCA) and the choice of DNA methylation quantification measure on downstream statistical inference. We show that for relatively large sample sizes feature selection using test statistics is similar for M and β-values, but that in the limit of small sample sizes, M-values allow more reliable identification of true positives. We also show that the effect of variance filtering on feature selection is study-specific and dependent on the phenotype of interest and tissue type profiled. Specifically, we find that variance filtering improves the detection of true positives in studies with large effect sizes, but that it may lead to worse performance in studies with smaller yet significant effect sizes. In contrast, supervised principal components improves the statistical power, especially in studies with small effect sizes. We also demonstrate that classification using the Elastic Net and Support Vector Machine (SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF) clearly outperforms principal components analysis.ConclusionsOur results highlight the importance of tailoring the feature selection and classification methodology to the sample size and biological context of the DNA methylation study. The Elastic Net emerges as a powerful classification algorithm for large-scale DNA methylation studies, while NMF does well in the unsupervised context. The insights presented here will be useful to any study embarking on large-scale DNA methylation profiling using Illumina Infinium beadarrays.

[1]  I. Ellis,et al.  An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer , 2007, Genome Biology.

[2]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[3]  Yingdong Zhao,et al.  Non-negative matrix factorization of gene expression profiles: a plug-in for BRB-ArrayTools , 2009, Bioinform..

[4]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[6]  M. Esteller,et al.  Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome , 2011, Epigenetics.

[7]  N. Ahuja,et al.  Accelerated age-related CpG island methylation in ulcerative colitis. , 2001, Cancer research.

[8]  Richard M. Simon,et al.  A Paradigm for Class Prediction Using Gene Expression Profiles , 2003, J. Comput. Biol..

[9]  Yingdong Zhao,et al.  Analysis of Gene Expression Data Using BRB-Array Tools , 2007, Cancer informatics.

[10]  P. Laird,et al.  Genome-scale analysis of aberrant DNA methylation in colorectal cancer. , 2012, Genome research.

[11]  J. Tchinda,et al.  Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. , 2006, Science.

[12]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[13]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[15]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[16]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Alexei A. Sharov,et al.  Gene expression A web-based tool for principal component and significance analysis of microarray data , 2005 .

[18]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[19]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[20]  Devin C. Koestler,et al.  Semi-supervised recursively partitioned mixture models for identifying cancer subtypes , 2010, Bioinform..

[21]  A. Feinberg,et al.  The epigenetic progenitor origin of human cancer , 2006, Nature Reviews Genetics.

[22]  Margaret R. Karagas,et al.  Copy number variation has little impact on bead-array-based measures of DNA methylation , 2009, Bioinform..

[23]  Guoli Wang,et al.  LS-NMF: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates , 2006, BMC Bioinformatics.

[24]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[26]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[27]  Andrew E. Teschendorff,et al.  PACK: Profile Analysis using Clustering and Kurtosis to find molecular classifiers in cancer , 2006, Bioinform..

[28]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[30]  Margaret R. Karagas,et al.  Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions , 2008, BMC Bioinformatics.

[31]  Nianxiang Zhang,et al.  Widespread and Tissue Specific Age-related Dna Methylation Material Supplemental Related Content a Hallmark of Cancer Age-dependent Dna Methylation of Genes That Are Suppressed in Stem Cells Is , 2022 .

[32]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[33]  Michael E. Wall,et al.  SVDMAN-singular value decomposition analysis of microarray data , 2001, Bioinform..

[34]  Wolfgang Wagner,et al.  Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. , 2010, Genome research.

[35]  Steve Horvath,et al.  Epigenetic Predictor of Age , 2011, PloS one.

[36]  Yudi Pawitan,et al.  Filtering genes to improve sensitivity in oligonucleotide microarray data analysis. , 2007, Nucleic acids research.

[37]  J. Leek Asymptotic Conditional Singular Value Decomposition for High‐Dimensional Genomic Data , 2011, Biometrics.

[38]  Jian-Bing Fan,et al.  Genome‐wide DNA methylation profiling , 2010, Wiley interdisciplinary reviews. Systems biology and medicine.

[39]  R. Wilson,et al.  Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. , 2010, Cancer cell.

[40]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[41]  Yuan Gao,et al.  Improving molecular cancer class discovery through sparse non-negative matrix factorization , 2005 .

[42]  Owen T McCann,et al.  Human aging-associated DNA hypermethylation occurs preferentially at bivalent chromatin domains. , 2010, Genome research.

[43]  Stephan Beck,et al.  Genome-wide DNA methylation analysis for diabetic nephropathy in type 1 diabetes mellitus , 2010, BMC Medical Genomics.

[44]  C. Sotiriou,et al.  Evaluation of the Infinium Methylation 450K technology. , 2011, Epigenomics.

[45]  P. Laird Principles and challenges of genome-wide DNA methylation analysis , 2010, Nature Reviews Genetics.

[46]  Andrew E. Teschendorff,et al.  Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies , 2011, Bioinform..

[47]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Li Liu,et al.  Robust singular value decomposition analysis of microarray data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[50]  A. Feinberg,et al.  Comprehensive High‐Throughput Arrays for Relative Methylation (CHARM) , 2010, Current protocols in human genetics.

[51]  S. Baylin,et al.  Aging and DNA methylation in colorectal mucosa and cancer. , 1998, Cancer research.

[52]  BMC Bioinformatics , 2005 .

[53]  A. Teschendorff,et al.  An Epigenetic Signature in Peripheral Blood Predicts Active Ovarian Cancer , 2009, PloS one.

[54]  D. Balding,et al.  Epigenome-wide association studies for common human diseases , 2011, Nature Reviews Genetics.

[55]  Wolfgang Wagner,et al.  Replicative senescence of mesenchymal stem cells causes DNA-methylation changes which correlate with repressive histone marks , 2011, Aging.

[56]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[57]  Renaud Gaujoux,et al.  A flexible R package for nonnegative matrix factorization , 2010, BMC Bioinformatics.

[58]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[59]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[60]  Rafael A Irizarry,et al.  Comprehensive high-throughput arrays for relative methylation (CHARM). , 2008, Genome research.

[61]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[62]  Peter A. Jones,et al.  The fundamental role of epigenetic events in cancer , 2002, Nature Reviews Genetics.

[63]  N. Ahuja,et al.  Aging, methylation and cancer. , 2000, Histology and histopathology.