Comparisons of Non-Gaussian Statistical Models in DNA Methylation Analysis

As a key regulatory mechanism of gene expression, DNA methylation patterns are widely altered in many complex genetic diseases, including cancer. DNA methylation is naturally quantified by bounded support data; therefore, it is non-Gaussian distributed. In order to capture such properties, we introduce some non-Gaussian statistical models to perform dimension reduction on DNA methylation data. Afterwards, non-Gaussian statistical model-based unsupervised clustering strategies are applied to cluster the data. Comparisons and analysis of different dimension reduction strategies and unsupervised clustering methods are presented. Experimental results show that the non-Gaussian statistical model-based methods are superior to the conventional Gaussian distribution-based method. They are meaningful tools for DNA methylation analysis. Moreover, among several non-Gaussian methods, the one that captures the bounded nature of DNA methylation data reveals the best clustering performance.

[1]  马文丽,et al.  GEO(Gene Expression Omnibus):高通量基因表达数据库 , 2007 .

[2]  Margaret R. Karagas,et al.  Copy number variation has little impact on bead-array-based measures of DNA methylation , 2009, Bioinform..

[3]  Nizar Bouguila,et al.  Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications , 2006, Stat. Comput..

[4]  A. Godwin,et al.  Microarrays in cancer: research and applications. , 2003, BioTechniques.

[5]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[6]  H. Kitchener,et al.  The Dynamics and Prognostic Potential of DNA Methylation Changes at Stem Cell Gene Loci in Women's Cancer , 2012, PLoS genetics.

[7]  Francesco Marabita,et al.  A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data , 2012, Bioinform..

[8]  Arne Leijon,et al.  Vector quantization of LSF parameters with a mixture of dirichlet distributions , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Jalil Taghia,et al.  Bayesian Estimation of the von-Mises Fisher Mixture Model with Variational Inference , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Hwan-Joon Kwon Performance of Non-Gaussian Distribution Based Communication and Compressed Sensing Systems , 2013 .

[12]  Pierre-Antoine Absil,et al.  Elucidating the Altered Transcriptional Programs in Breast Cancer using Independent Component Analysis , 2007, PLoS Comput. Biol..

[13]  James A. Rodger,et al.  Toward reducing failure risk in an integrated vehicle health maintenance system: A fuzzy multi-sensor data fusion Kalman filter approach for IVHMS , 2012, Expert Syst. Appl..

[14]  Carsten Wiuf,et al.  A Beta-mixture model for dimensionality reduction, sample classification and analysis , 2011, BMC Bioinformatics.

[15]  Anirban DasGupta,et al.  Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics , 2011 .

[16]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[17]  Arne Leijon,et al.  Bayesian Estimation of Beta Mixture Models with Variational Inference , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  R. Strawderman Continuous Multivariate Distributions, Volume 1: Models and Applications , 2001 .

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Concha Bielza,et al.  The von Mises Naive Bayes Classifier for Angular Data , 2011, CAEPIA.

[21]  V. Calhoun,et al.  A Study of the Influence of Sex on Genome Wide Methylation , 2010, PloS one.

[22]  Zhanyu Ma,et al.  A variational Bayes beta Mixture Model for Feature Selection in DNA methylation Studies , 2013, J. Bioinform. Comput. Biol..

[23]  M. Esteller,et al.  Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome , 2011, Epigenetics.

[24]  Thorsten Gerber,et al.  Handbook Of Mathematical Functions , 2016 .

[25]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[26]  Bernhard M. Schuldt,et al.  A bioinformatic assay for pluripotency in human cells , 2011, Nature Methods.

[27]  Jun Guo,et al.  An Activation Force-based Affinity Measure for Analyzing Complex Networks , 2011, Scientific reports.

[28]  Zhanyu Ma Non-Gaussian Statistical Modelsand Their Applications , 2011 .

[29]  S. E. Ahmed,et al.  Handbook of Statistical Distributions with Applications , 2007, Technometrics.

[30]  Yuan Ji,et al.  Applications of beta-mixture models in bioinformatics , 2005, Bioinform..

[31]  Margaret R. Karagas,et al.  Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions , 2008, BMC Bioinformatics.

[32]  Zheng-Hua Tan,et al.  EEG signal classification with super-Dirichlet mixture model , 2012, 2012 IEEE Statistical Signal Processing Workshop (SSP).

[33]  Xin Zhou,et al.  A statistical framework for Illumina DNA methylation arrays , 2010, Bioinform..

[34]  S. R. Jammalamadaka,et al.  Directional Statistics, I , 2011 .

[35]  Andrew E. Teschendorff,et al.  A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform , 2012, BMC Bioinformatics.

[36]  Dongsup Kim,et al.  LinkNMF: identification of histone modification modules in the human genome using nonnegative matrix factorization. , 2013, Gene.

[37]  Guoli Wang,et al.  LS-NMF: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates , 2006, BMC Bioinformatics.

[38]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[39]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[40]  Inderjit S. Dhillon,et al.  Diametrical clustering for identifying anti-correlated gene clusters , 2003, Bioinform..

[41]  Karen N Conneely,et al.  MethLAB: A graphical user interface package for the analysis of array-based DNA methylation data , 2012, Epigenetics.

[42]  Richard T. Barfield,et al.  CpGassoc: an R function for analysis of DNA methylation microarray data , 2012, Bioinform..

[43]  Tyson A. Clark,et al.  Direct detection of DNA methylation during single-molecule, real-time sequencing , 2010, Nature Methods.

[44]  Ronald F. Boisvert,et al.  NIST Handbook of Mathematical Functions , 2010 .

[45]  Arne Leijon,et al.  Beta mixture models and the application to image classification , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[46]  Arturas Petronis,et al.  Epigenetics as a unifying principle in the aetiology of complex traits and diseases , 2010, Nature.

[47]  N. L. Johnson,et al.  Continuous Multivariate Distributions, Volume 1: Models and Applications , 2019 .

[48]  Devin C. Koestler,et al.  Semi-supervised recursively partitioned mixture models for identifying cancer subtypes , 2010, Bioinform..

[49]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[50]  Peter A. Jones,et al.  The Epigenomics of Cancer , 2007, Cell.

[51]  James A. Rodger,et al.  A fuzzy nearest neighbor neural network statistical model for predicting demand for natural gas and energy cost savings in public buildings , 2014, Expert Syst. Appl..

[52]  V. Plerou,et al.  Random matrix approach to cross correlations in financial data. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[53]  Hung T. Nguyen,et al.  Probability for statistics , 1989 .

[54]  Andrei Yu. Zinovyev,et al.  Blind source separation methods for deconvolution of complex signals in cancer biology , 2013, Biochemical and biophysical research communications.

[55]  Andrew E. Teschendorff,et al.  Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies , 2011, Bioinform..

[56]  Arne Leijon,et al.  Predictive Distribution of the Dirichlet Mixture Model by Local Variational Inference , 2014, J. Signal Process. Syst..

[57]  Honggang Zhang,et al.  Variational Bayesian Matrix Factorization for Bounded Support Data , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Roland R. Regoes,et al.  The Role of Exposure History on HIV Acquisition: Insights from Repeated Low-dose Challenge Studies , 2012, PLoS Comput. Biol..

[59]  Stephan Beck,et al.  Genome-wide DNA methylation analysis for diabetic nephropathy in type 1 diabetes mellitus , 2010, BMC Medical Genomics.

[60]  Leon M. Hall,et al.  Special Functions , 1998 .

[61]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[62]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[63]  Christopher Bingham An Antipodally Symmetric Distribution on the Sphere , 1974 .

[64]  P. Laird,et al.  Epigenetic stem cell signature in cancer , 2007, Nature Genetics.

[65]  A. Bird,et al.  CpG islands and the regulation of transcription. , 2011, Genes & development.

[66]  Suvrit Sra,et al.  A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of Is(x) , 2012, Comput. Stat..

[67]  Edward J. Wegman,et al.  Topics in Non-Gaussian Signal Processing , 2011 .

[68]  Nizar Bouguila,et al.  High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Suvrit Sra,et al.  The multivariate Watson distribution: Maximum-likelihood estimation and other aspects , 2011, J. Multivar. Anal..

[70]  Jalil Taghia,et al.  Variational Inference for Watson Mixture Model , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.