A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints

In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero-inflated Poisson mixture modeling framework that incorporates a model-based normalization through prior distributions with mean constraints, as well as a feature selection mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a bag-of-words benchmark data set, where the features are represented by the frequencies of occurrence of each word.

[1]  George Casella,et al.  Sampling schemes for generalized linear Dirichlet process random effects models , 2011, Stat. Methods Appl..

[2]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[3]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[4]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[5]  Thomas L. Griffiths,et al.  The Indian Buffet Process: An Introduction and Review , 2011, J. Mach. Learn. Res..

[6]  T. Fearn,et al.  Bayesian wavelength selection in multicomponent analysis , 1998 .

[7]  Edoardo M. Airoldi,et al.  Improving and Evaluating Topic Models and Other Models of Text , 2016 .

[8]  William G. Ambrose,et al.  Effects of organic enrichment on meiofaunal abundance and community structure in sublittoral soft sediments , 1985 .

[9]  P. Gustafson On Model Expansion, Model Contraction, Identifiability and Prior Information: Two Illustrative Scenarios Involving Mismeasured Variables , 2005 .

[10]  Marina Vannucci,et al.  Variable selection in clustering via Dirichlet process mixture models , 2006 .

[11]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[12]  Daniela M. Witten,et al.  Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[13]  Antonio Canale,et al.  Robustifying Bayesian nonparametric mixtures for count data , 2017, Biometrics.

[14]  Pravin K. Trivedi,et al.  Regression Analysis of Count Data , 1998 .

[15]  A. Gelman Objections to Bayesian statistics , 2008 .

[16]  H. Bondell,et al.  Flexible Bayesian quantile regression for independent and clustered data. , 2010, Biostatistics.

[17]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[18]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[19]  A. Kottas,et al.  Mixture Modeling for Marked Poisson Processes , 2010, 1012.2105.

[20]  Harry Crane,et al.  The Ubiquitous Ewens Sampling Formula , 2016 .

[21]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[22]  T. Fearn,et al.  Multivariate Bayesian variable selection and prediction , 1998 .

[23]  Peter Müller,et al.  A Bayesian semiparametric approach for the differential analysis of sequence counts data , 2014, Journal of the Royal Statistical Society. Series C, Applied statistics.

[24]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[25]  Antonio Canale,et al.  Bayesian Kernel Mixtures for Counts , 2011, Journal of the American Statistical Association.

[26]  P. Müller,et al.  Bayesian inference for intratumour heterogeneity in mutations and copy number variation , 2016, Journal of the Royal Statistical Society. Series C, Applied statistics.

[27]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[28]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[29]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[30]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[31]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[32]  Mingyuan Zhou Beta-Negative Binomial Process and Exchangeable Random Partitions for Mixed-Membership Modeling , 2014, NIPS.

[33]  A. Cameron,et al.  Microeconometrics: Methods and Applications , 2005 .

[34]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[35]  Hongzhe Li,et al.  VARIABLE SELECTION FOR SPARSE DIRICHLET-MULTINOMIAL REGRESSION WITH AN APPLICATION TO MICROBIOME DATA ANALYSIS. , 2013, The annals of applied statistics.

[36]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[37]  Fernando A. Quintana,et al.  Bayesian Nonparametric Data Analysis , 2015 .

[38]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[39]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[40]  Chong Wang,et al.  The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling , 2010, ICML.

[41]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[42]  P. Müller,et al.  Optimal Sample Size for Multiple Testing , 2004 .

[43]  Margaret E. Roberts,et al.  A Model of Text for Experimentation in the Social Sciences , 2016 .

[44]  Stephen E. Fienberg,et al.  Who Wrote Ronald Reagan's Radio Addresses? , 2006 .

[45]  C. Morris Parametric Empirical Bayes Inference: Theory and Applications , 1983 .

[46]  Michael I. Jordan,et al.  Feature allocations, probability functions, and paintboxes , 2013, 1301.6647.

[47]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Peter D. Hoff,et al.  Nonparametric estimation of convex models via mixtures , 2003 .

[49]  Yuan Ji,et al.  A Bayesian feature allocation model for tumor heterogeneity , 2015, 1509.04026.

[50]  Stephen E Fienberg,et al.  Reconceptualizing the classification of PNAS articles , 2010, Proceedings of the National Academy of Sciences.

[51]  P. Müller,et al.  A Bayesian discovery procedure , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[52]  Sw. Banerjee,et al.  Hierarchical Modeling and Analysis for Spatial Data , 2003 .

[53]  L. Ohno-Machado,et al.  Genomic Analysis of Mouse Retinal Development , 2004, PLoS biology.

[54]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[55]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[56]  Diane Lambert,et al.  Zero-inflacted Poisson regression, with an application to defects in manufacturing , 1992 .

[57]  Lorenzo Trippa,et al.  False discovery rates in somatic mutation studies of cancer , 2011, 1107.4843.