Classification of large DNA methylation datasets for identifying cancer drivers

DNA methylation is a well-studied genetic modification crucial to regulate the functioning of the genome. Its alterations play an important role in tumorigenesis and tumor-suppression. Thus, studying DNA methylation data may help biomarker discovery in cancer. Since public data on DNA methylation become abundant, and considering the high number of methylated sites (features) present in the genome, it is important to have a method for efficiently processing such large datasets. Relying on big data technologies, we propose BIGBIOCL an algorithm that can apply supervised classification methods to datasets with hundreds of thousands of features. It is designed for the extraction of alternative and equivalent classification models through iterative deletion of selected features. We run experiments on DNA methylation datasets extracted from The Cancer Genome Atlas, focusing on three tumor types: breast, kidney, and thyroid carcinomas. We perform classifications extracting several methylated sites and their associated genes with accurate performance. Results suggest that BIGBIOCL can perform hundreds of classification iterations on hundreds of thousands of features in few hours. Moreover, we compare the performance of our method with other state-of-the-art classifiers and with a wide-spread DNA methylation analysis method based on network analysis. Finally, we are able to efficiently compute multiple alternative classification models and extract, from DNA-methylation large datasets, a set of candidate genes to be further investigated to determine their active role in cancer. BIGBIOCL, results of experiments, and a guide to carry on new experiments are freely available on GitHub.

[1]  M. Esteller,et al.  DNA methylation and cancer. , 2010, Advances in genetics.

[2]  Amiram Gafni,et al.  BRCA1 and BRCA2 , 2013 .

[3]  C. Arnaud The $1,000 genome , 2005 .

[4]  Giovanni Felici,et al.  CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules , 2015, Bioinform..

[5]  Andrew E. Teschendorff,et al.  A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform , 2012, BMC Bioinformatics.

[6]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[7]  Qing Zhao,et al.  Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA , 2015, Briefings Bioinform..

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  Gianluca Bontempi,et al.  A comprehensive overview of Infinium HumanMethylation450 data processing , 2013, Briefings Bioinform..

[10]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[11]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[12]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[13]  Jie Tan,et al.  Big Data Bioinformatics , 2014, Journal of cellular physiology.

[14]  Kyunghee Park,et al.  Clinical implications of genomic profiles in metastatic breast cancer with a focus on TP53 and PIK3CA, the most frequently mutated genes , 2017, Oncotarget.

[15]  Francine E. Garrett-Bakelman,et al.  methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles , 2012, Genome Biology.

[16]  Kenny Q. Ye,et al.  An Integrative Genomic and Epigenomic Approach for the Study of Transcriptional Regulation , 2008, PloS one.

[17]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[18]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[21]  Gangning Liang,et al.  DNA methylation screening identifies driver epigenetic events of cancer cell survival. , 2012, Cancer cell.

[22]  Daniele Santoni,et al.  Next generation sequencing reads comparison with an alignment-free distance , 2014, BMC Research Notes.

[23]  Qiu Qin,et al.  Genome-wide DNA methylation analysis in Graves' disease. , 2015, Genomics.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Jin Gu,et al.  Genome-wide DNA methylation analysis identifies candidate epigenetic markers and drivers of hepatocellular carcinoma , 2016, Briefings Bioinform..

[26]  Melissa Rodrigues,et al.  A new case of “de novo” BRCA1 mutation in a patient with early‐onset breast cancer , 2017, Clinical case reports.

[27]  A. Feinberg,et al.  The epigenetic progenitor origin of human cancer , 2006, Nature Reviews Genetics.

[28]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[29]  Holly Neibergs,et al.  Breast and ovarian cancer risks due to inherited mutations in BRCA1 and BRCA2 , 2004 .

[30]  C. Nordborg,et al.  MethPed: a DNA methylation classifier tool for the identification of pediatric brain tumor subtypes , 2015, Clinical Epigenetics.

[31]  M. Esteller,et al.  Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences , 2015, Epigenomics.

[32]  Jeffrey B. Cheng,et al.  Estimating absolute methylation levels at single-CpG resolution from methylation enrichment and restriction enzyme sequencing methods , 2013, RECOMB.

[33]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[34]  Thomas E. Bartlett,et al.  A DNA Methylation Network Interaction Measure, and Detection of Network Oncomarkers , 2014, PloS one.

[35]  Erika Check Hayden,et al.  Technology: The $1,000 genome , 2014, Nature.

[36]  Yi-Fang Tsai,et al.  Brain-derived neurotrophic factor (BDNF) -TrKB signaling modulates cancer-endothelial cells interaction and affects the outcomes of triple negative breast cancer , 2017, PloS one.

[37]  Giovanni Felici,et al.  Genomic Data Integration: A Case Study on Next Generation Sequencing of Cancer , 2016, 2016 27th International Workshop on Database and Expert Systems Applications (DEXA).

[38]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[39]  Leo Lebanov,et al.  Random Forests machine learning applied to gas chromatography - Mass spectrometry derived average mass spectrum data sets for classification and characterisation of essential oils. , 2020, Talanta.

[40]  S. Baylin,et al.  DNA methylation and gene silencing in cancer , 2005, Nature Clinical Practice Oncology.

[41]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[42]  J. Herman,et al.  Aberrant patterns of DNA methylation, chromatin formation and gene expression in cancer. , 2001, Human molecular genetics.

[43]  Ruth Pidsley,et al.  A data-driven approach to preprocessing Illumina 450K methylation array data , 2013, BMC Genomics.

[44]  C. Land,et al.  Early-onset breast cancer in A-bomb survivors , 1993, The Lancet.

[45]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.