Clustering Count-based RNA Methylation Data Using a Nonparametric Generative Model

RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

[1]  Adelino R. Ferreira da Silva,et al.  A Dirichlet process mixture model for brain MRI tissue classification , 2007, Medical Image Anal..

[2]  Yufei Huang,et al.  A protocol for RNA methylation differential analysis with MeRIP-Seq data and exomePeak R/Bioconductor package. , 2014, Methods.

[3]  Qiang Wang,et al.  Structural basis of N6-adenosine methylation by the METTL3–METTL14 complex , 2016, Nature.

[4]  Jie Wu,et al.  RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data , 2015, Nucleic Acids Res..

[5]  M. Schaefer,et al.  RNA 5-Methylcytosine Analysis by Bisulfite Sequencing. , 2015, Methods in enzymology.

[6]  Tao Pan,et al.  Structures of the m(6)A Methyltransferase Complex: Two Subunits with Distinct but Coordinated Roles. , 2016, Molecular cell.

[7]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[8]  M. Kupiec,et al.  Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq , 2012, Nature.

[9]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[10]  Wanjin Hong,et al.  N6-Methyladenosine: a conformational marker that regulates the substrate specificity of human demethylases FTO and ALKBH5 , 2016, Scientific Reports.

[11]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[12]  A. Feinberg,et al.  Intra-individual change over time in DNA methylation with familial clustering. , 2008, JAMA.

[13]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  R. Gregory,et al.  The m(6)A Methyltransferase METTL3 Promotes Translation in Human Cancer Cells. , 2016, Molecular cell.

[16]  Minoru Yoshida,et al.  RNA-Methylation-Dependent RNA Processing Controls the Speed of the Circadian Clock , 2013, Cell.

[17]  T E Karakasidis,et al.  Fuzzy polynucleotide spaces and metrics , 2006, Bulletin of mathematical biology.

[18]  Gideon Rechavi,et al.  The dynamic N1-methyladenosine methylome in eukaryotic messenger RNA , 2016, Nature.

[19]  Zhike Lu,et al.  m6A-dependent regulation of messenger RNA stability , 2013, Nature.

[20]  Jie Jin,et al.  FTO Plays an Oncogenic Role in Acute Myeloid Leukemia as a N6-Methyladenosine RNA Demethylase. , 2017, Cancer cell.

[21]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[22]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[23]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[24]  Schraga Schwartz,et al.  High-Resolution Mapping Reveals a Conserved, Widespread, Dynamic mRNA Methylation Program in Yeast Meiosis , 2013, Cell.

[25]  Stephen G. Walker,et al.  Sampling the Dirichlet Mixture Model with Slices , 2006, Commun. Stat. Simul. Comput..

[26]  Andrew M. Dai,et al.  The Supervised Hierarchical Dirichlet Process , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Peter Müller,et al.  DPpackage: Bayesian Semi- and Nonparametric Modeling in R , 2011 .

[28]  Yufei Huang,et al.  Decomposition of RNA methylome reveals co-methylation patterns induced by latent enzymatic regulators of the epitranscriptome. , 2015, Molecular bioSystems.

[29]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[30]  Schraga Schwartz,et al.  Perturbation of m6A writers reveals two distinct classes of mRNA methylation at internal and 5' sites. , 2014, Cell reports.

[31]  D. B. Dahl Bayesian Inference for Gene Expression and Proteomics: Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model , 2006 .

[32]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[33]  Chengqi Yi,et al.  N6-Methyladenosine in Nuclear RNA is a Major Substrate of the Obesity-Associated FTO , 2011, Nature chemical biology.

[34]  Samir Adhikari,et al.  Mammalian WTAP is a regulatory subunit of the RNA N6-methyladenosine methyltransferase , 2014, Cell Research.

[35]  M. Jinek,et al.  Structural insights into the molecular mechanism of the m6A writer complex , 2016, eLife.

[36]  S. MacEachern,et al.  Estimating mixture of dirichlet process models , 1998 .

[37]  Arne Klungland,et al.  ALKBH5 is a mammalian RNA demethylase that impacts RNA metabolism and mouse fertility. , 2013, Molecular cell.

[38]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[39]  Miao Yu,et al.  A METTL3-METTL14 complex mediates mammalian nuclear RNA N6-adenosine methylation , 2013, Nature chemical biology.

[40]  Michelle Monje,et al.  Settling a Nervous Stomach: The Neural Regulation of Enteric Cancer. , 2017, Cancer cell.

[41]  Ping Wang,et al.  Structural Basis for Cooperative Function of Mettl3 and Mettl14 Methyltransferases. , 2016, Molecular cell.

[42]  Ujjwal Maulik,et al.  Fuzzy clustering of physicochemical and biochemical properties of amino Acids , 2011, Amino Acids.

[43]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[44]  Chuan He,et al.  m6A Demethylase ALKBH5 Maintains Tumorigenicity of Glioblastoma Stem-like Cells by Sustaining FOXM1 Expression and Cell Proliferation Program. , 2017, Cancer cell.

[45]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[46]  Adrian E. Raftery,et al.  MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-Based Clustering , 2006 .

[47]  Hui Liu,et al.  MeT-DB: a database of transcriptome methylation in mammalian cells , 2014, Nucleic Acids Res..

[48]  M. Jarvelin,et al.  A Common Variant in the FTO Gene Is Associated with Body Mass Index and Predisposes to Childhood and Adult Obesity , 2007, Science.

[49]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Wei Gu,et al.  RNA-MethylPred: A high-accuracy predictor to identify N6-methyladenosine in RNA. , 2016, Analytical biochemistry.

[51]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[52]  Simon Hess,et al.  The fat mass and obesity associated gene (Fto) regulates activity of the dopaminergic midbrain circuitry , 2013, Nature Neuroscience.