Classifying Big DNA Methylation Data: A Gene-Oriented Approach

Thanks to Next Generation Sequencing (NGS) techniques, public available genomic data of cancer is growing quickly. Indeed, the largest public database of cancer called The Cancer Genome Atlas (TCGA) contains huge amounts of biomedical big data to be analyzed with advanced knowledge extraction methods. In this work, we focus on the NGS experiment of DNA methylation, whose data matrices are composed of hundred thousands of features (i.e., methylated sites). We propose an efficient data processing procedure that permits to obtain a gene-oriented organization and enables to perform a supervised machine learning analysis with state-of-the-art methods. The procedure divides the original data matrices into several sub-matrices, each one containing the sites located within the same gene. We extract from TCGA DNA methylation data of three tumor types (i.e., breast, prostate, and thyroid carcinomas) and we are able to successfully discriminate tumoral from non tumoral samples using function-, tree-, and rule-based classifiers. Finally, we select the best performing genes (matrices) ranking them according to the accuracy of the classifiers and we execute an enrichment analysis of them. Those genes can be further investigated by domain experts for proving their relation to the cancers under study.

[1]  Fabio Cumbo,et al.  Classification of large DNA methylation datasets for identifying cancer drivers , 2018, Big Data Res..

[2]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[3]  Christian Darabos,et al.  The multiscale backbone of the human phenotype network based on biological pathways , 2014, BioData Mining.

[4]  L. Staudt,et al.  The NCI Genomic Data Commons as an engine for precision medicine. , 2017, Blood.

[5]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[6]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[7]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[8]  Qilong Yi,et al.  Methylation patterns of cell-free plasma DNA in relapsing–remitting multiple sclerosis , 2010, Journal of the Neurological Sciences.

[9]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[10]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[11]  Sun-Chong Wang,et al.  Epigenomic profiling reveals DNA-methylation changes associated with major psychosis. , 2008, American journal of human genetics.

[12]  Erika Check Hayden,et al.  Technology: The $1,000 genome , 2014, Nature.

[13]  Giovanni Felici,et al.  MALA: A Microarray Clustering and Classification Software , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[14]  Marco Masseroli,et al.  TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas , 2017, BMC Bioinformatics.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  P. Bucher,et al.  Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers. , 2014, Genomics.

[17]  Xiaofei Yang,et al.  Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns , 2016, Briefings Bioinform..

[18]  B. Cullen,et al.  Sequence requirements for micro RNA processing and function in human cells. , 2003, RNA.

[19]  Juli D. Klemm,et al.  A Comprehensive Infrastructure for Big Data in Cancer Research: Accelerating Cancer Research and Precision Medicine , 2017, Front. Cell Dev. Biol..

[20]  Kamel Jabbari,et al.  Cytosine methylation and CpG, TpG (CpA) and TpA frequencies. , 2004, Gene.

[21]  A. Bird CpG-rich islands and the function of DNA methylation , 1986, Nature.

[22]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[23]  Daniele Santoni,et al.  Next generation sequencing reads comparison with an alignment-free distance , 2014, BMC Research Notes.

[24]  Giovanni Felici,et al.  CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules , 2015, Bioinform..

[25]  Giovanni Felici,et al.  Clinical Data Mining: Problems, Pitfalls and Solutions , 2013, 2013 24th International Workshop on Database and Expert Systems Applications.

[26]  Yao Lin,et al.  Quantitative and correlation analysis of the DNA methylation and expression of DAPK in breast cancer , 2017, PeerJ.

[27]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[28]  M. Esteller,et al.  Epigenetic modifications and human disease , 2010, Nature Biotechnology.

[29]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[30]  Jae-Wook Song,et al.  Observational Studies: Cohort and Case-Control Studies , 2010, Plastic and reconstructive surgery.

[31]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[32]  P. Bertolazzi,et al.  BLOG 2.0: a software system for character‐based species classification with DNA Barcode sequences. What it does, how to use it , 2013, Molecular ecology resources.

[33]  Man Tong,et al.  Abstract 3315: Identification of ZFP42/REX1 as a regulator of cancer stemness in CD133+liver cancer stem cells by genome-wide DNA methylation analysis , 2018, Molecular and Cellular Biology / Genetics.

[34]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[35]  Dvir Aran,et al.  Genome-wide survey reveals predisposing diabetes type 2-related DNA methylation variations in human peripheral blood. , 2012, Human molecular genetics.

[36]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[37]  Li Ding,et al.  The Pediatric Cancer Genome Project , 2012, Nature Genetics.

[38]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[39]  C. Sheridan Illumina claims $1,000 genome win , 2014, Nature Biotechnology.

[40]  Sreeram V Ramagopalan,et al.  Epigenetics: molecular mechanisms and implications for disease. , 2010, Trends in molecular medicine.