A Novel Computational Method for Detecting DNA Methylation Sites with DNA Sequence Information and Physicochemical Properties

DNA methylation is an important biochemical process, and it has a close connection with many types of cancer. Research about DNA methylation can help us to understand the regulation mechanism and epigenetic reprogramming. Therefore, it becomes very important to recognize the methylation sites in the DNA sequence. In the past several decades, many computational methods—especially machine learning methods—have been developed since the high-throughout sequencing technology became widely used in research and industry. In order to accurately identify whether or not a nucleotide residue is methylated under the specific DNA sequence context, we propose a novel method that overcomes the shortcomings of previous methods for predicting methylation sites. We use k-gram, multivariate mutual information, discrete wavelet transform, and pseudo amino acid composition to extract features, and train a sparse Bayesian learning model to do DNA methylation prediction. Five criteria—area under the receiver operating characteristic curve (AUC), Matthew’s correlation coefficient (MCC), accuracy (ACC), sensitivity (SN), and specificity—are used to evaluate the prediction results of our method. On the benchmark dataset, we could reach 0.8632 on AUC, 0.8017 on ACC, 0.5558 on MCC, and 0.7268 on SN. Additionally, the best results on two scBS-seq profiled mouse embryonic stem cells datasets were 0.8896 and 0.9511 by AUC, respectively. When compared with other outstanding methods, our method surpassed them on the accuracy of prediction. The improvement of AUC by our method compared to other methods was at least 0.0399. For the convenience of other researchers, our code has been uploaded to a file hosting service, and can be downloaded from: https://figshare.com/s/0697b692d802861282d3.

[1]  Xinying Xu,et al.  An Ameliorated Prediction of Drug–Target Interactions Based on Multi-Scale Discrete Wavelet Transform and Network Features , 2017, International journal of molecular sciences.

[2]  Gavin Sherlock,et al.  DNA methylation profiling reveals novel biomarkers and important roles for DNA methyltransferases in prostate cancer. , 2011, Genome research.

[3]  Bernard J. Pope,et al.  MethPat: a tool for the analysis and visualisation of complex methylation patterns obtained by massively parallel sequencing , 2016, BMC Bioinformatics.

[4]  Michael B. Stadler,et al.  Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome , 2007, Nature Genetics.

[5]  Loris Nanni,et al.  Hyperplanes for predicting protein-protein interactions , 2005, Neurocomputing.

[6]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[7]  Ryan Gandy,et al.  A refined DNA methylation detection method using MspJI coupled quantitative PCR. , 2017, Analytical biochemistry.

[8]  S. Nelson,et al.  Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning , 2008, Nature.

[9]  Yongshuai Jiang,et al.  Genetic Variants and Multiple Sclerosis Risk Gene SLC9A9 Expression in Distinct Human Brain Regions , 2016, Molecular Neurobiology.

[10]  Manoj Bhasin,et al.  Prediction of methylated CpGs in DNA sequences using a support vector machine , 2005, FEBS letters.

[11]  T. Down,et al.  Genome-wide conserved consensus transcription factor binding motifs are hyper-methylated , 2010, BMC Genomics.

[12]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[13]  M. Neel,et al.  The Molecular Basis of Human Cancer , 1993 .

[14]  Zhaolei Zhang,et al.  SNPdryad: predicting deleterious non-synonymous human SNPs using only orthologous protein sequences , 2014, Bioinform..

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[17]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[18]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Jiuwen Cao,et al.  Protein Sequence Classification with Improved Extreme Learning Machine Algorithms , 2014, BioMed research international.

[20]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[21]  Mark J. Shensa,et al.  The discrete wavelet transform: wedding the a trous and Mallat algorithms , 1992, IEEE Trans. Signal Process..

[22]  Thomas Lengauer,et al.  CpG Island Methylation in Human Lymphocytes Is Highly Correlated with DNA Sequence, Repeats, and Predicted DNA Structure , 2006, PLoS genetics.

[23]  O. Stegle,et al.  Single-Cell Genome-Wide Bisulfite Sequencing for Assessing Epigenetic Heterogeneity , 2014, Nature Methods.

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  Chris Adami,et al.  Information theory of quantum entanglement and measurement , 1998 .

[26]  Loris Nanni,et al.  An ensemble of K-local hyperplanes for predicting protein-protein interactions , 2006, Bioinform..

[27]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[28]  T. Spector,et al.  Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements , 2013, Genome Biology.

[29]  W. Reik,et al.  Epigenetic Reprogramming in Mammalian Development , 2001, Science.

[30]  R. Dammann,et al.  Impact of Natural Compounds on DNA Methylation Levels of the Tumor Suppressor Gene RASSF1A in Cancer , 2017, International journal of molecular sciences.

[31]  Jijun Tang,et al.  Improved detection of DNA-binding proteins via compression technology on PSSM information , 2017, PloS one.

[32]  Michael Q. Zhang,et al.  DIRECTION: a machine learning framework for predicting and characterizing DNA methylation and hydroxymethylation in mammalian genomes , 2017, Bioinform..

[33]  Zhaolei Zhang,et al.  DNA motif elucidation using belief propagation , 2013, Nucleic acids research.

[34]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[35]  Yongshuai Jiang,et al.  Alzheimer’s Disease Variants with the Genome-Wide Significance are Significantly Enriched in Immune Pathways and Active in Immune Cells , 2015, Molecular Neurobiology.

[36]  D. Patel,et al.  Structure-Based Mechanistic Insights into DNMT1-Mediated Maintenance DNA Methylation , 2012, Science.

[37]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[38]  Igor Zwir,et al.  Profile analysis and prediction of tissue-specific CpG island methylation classes , 2009, BMC Bioinformatics.

[39]  Burhan Ergen,et al.  Signal and Image Denoising Using Wavelet Transform , 2012 .

[40]  Jijun Tang,et al.  Predicting protein-protein interactions via multivariate mutual information of protein sequences , 2016, BMC Bioinformatics.

[41]  G. Dammann,et al.  Aberrant DNA Methylation of rDNA and PRIMA1 in Borderline Personality Disorder , 2016, International journal of molecular sciences.

[42]  Christoph Grunau,et al.  An improved version of the DNA methylation database (MethDB) , 2003, Nucleic Acids Res..

[43]  O. Stegle,et al.  Accurate prediction of single-cell DNA methylation states using deep learning , 2016, bioRxiv.

[44]  Peter A. Jones Functions of DNA methylation: islands, start sites, gene bodies and beyond , 2012, Nature Reviews Genetics.

[45]  W. Zhong,et al.  Molecular Science for Drug Development and Biomedicine , 2014, International journal of molecular sciences.

[46]  Nico las,et al.  Information theory of quantum entanglement and measurement * , 2003 .

[47]  K. Chou,et al.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. , 2015, Analytical biochemistry.

[48]  Dimitris N. Georgiou,et al.  A Short Survey on Genetic Sequences, Chou’s Pseudo Amino Acid Composition and its Combination with Fuzzy Set Theory , 2013 .

[49]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[50]  Cornelia Caragea,et al.  Protein Sequence Classification Using Feature Hashing , 2011, BIBM.

[51]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[52]  Hongwei Wu,et al.  CpGIMethPred: computational model for predicting methylation status of CpG islands in human genome , 2013, BMC Medical Genomics.

[53]  Michael Q. Zhang,et al.  Computational prediction of methylation status in human genomic sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Yongshuai Jiang,et al.  Integrating genome-wide association studies and gene expression data highlights dysregulated multiple sclerosis risk pathways , 2017, Multiple sclerosis.

[55]  Yadong Wang,et al.  miR2Disease: a manually curated database for microRNA deregulation in human disease , 2008, Nucleic Acids Res..

[56]  Thomas Braun,et al.  Validation of Tuba1a as Appropriate Internal Control for Normalization of Gene Expression Analysis during Mouse Lung Development , 2015, International journal of molecular sciences.

[57]  Michael Q. Zhang,et al.  Bioinformatics Original Paper Predicting Methylation Status of Cpg Islands in the Human Brain , 2022 .

[58]  C. Ponting,et al.  Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity , 2015, Nature Methods.

[59]  Modesto Orozco,et al.  Determining promoter location based on DNA structure first-principles calculations , 2007, Genome Biology.

[60]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[61]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .