Sampling designs via a multivariate hypergeometric-Dirichlet process model for a multi-species assemblage with unknown heterogeneity

In a sample of mRNA species counts, sequences without duplicates or with small numbers of copies are likely to carry information related to mutations or diseases and can be of great interest. However, in some situations, sequence abundance is unknown and sequencing the whole sample to find the rare sequences is not practically possible. To collect mRNA sequences of interest, or more generally, species of interest, we propose a two-phase Bayesian sampling method that addresses these concerns. The first phase of the design is used to infer sequence (species) abundance levels through a cluster analysis applied to a pilot data set. The clustering method is built upon a multivariate hypergeometric model with a Dirichlet process prior for species relative frequencies. The second phase, through Monte Carlo simulations, infers the sample size necessary to collect a certain number of species of particular interest. Efficient posterior computing schemes are proposed. The developed approach is demonstrated and evaluated via simulations. An mRNA segment data set is used to illustrate and motivate the proposed sampling method.

[1]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[2]  N. Ebrahimi,et al.  Bayesian capture-recapture methods for error detection and estimation of population size: Heterogeneity and dependence , 2001 .

[3]  Robert K. Colwell,et al.  ESTIMATION OF SPECIES RICHNESS: MIXTURE MODELS, THE ROLE OF RARE SPECIES, AND INFERENTIAL CHALLENGES , 2005 .

[4]  A. Chao,et al.  Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample , 2004, Environmental and Ecological Statistics.

[5]  Ioannis P. Androulakis,et al.  Bioinformatics analysis of the early inflammatory response in a rat thermal injury model , 2007, BMC Bioinformatics.

[6]  Steven K. Thompson,et al.  Adaptive Cluster Sampling: Designs with Primary and Secondary Units , 1991 .

[7]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[8]  P Besbeas,et al.  Integrating Mark–Recapture–Recovery and Census Data to Estimate Animal Abundance and Demographic Parameters , 2002, Biometrics.

[9]  Kenneth H Pollock,et al.  Open Capture–Recapture Models with Heterogeneity: II. Jolly–Seber Model , 2010, Biometrics.

[10]  Chang Xuan Mao,et al.  Inference on the Number of Species Through Geometric Lower Bounds , 2006 .

[11]  Edward A. Wasil Aspects of Uncertainty. A Tribute to D. V. Lindley , 1995 .

[12]  Antonio Lijoi,et al.  A Bayesian nonparametric method for prediction in EST analysis , 2007, BMC Bioinformatics.

[13]  A. Kan,et al.  A multinomial Bayesian approach to the estimation of population and vocabulary size , 1987 .

[14]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[15]  Ramsés H. Mena,et al.  Bayesian Nonparametric Estimation of the Probability of Discovering New Species , 2007 .

[16]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[17]  Li Zhang,et al.  Modeling Unobserved Sources of Heterogeneity in Animal Abundance Using a Dirichlet Process Prior , 2008, Biometrics.

[18]  George Casella,et al.  Estimation in Dirichlet random effects models , 2010, 1002.4756.

[19]  P. Müller,et al.  10 Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model , 2006 .

[20]  B. McArdle When are rare species not there , 1990 .

[21]  Michael A. West,et al.  Hierarchical priors and mixture models, with applications in regression and density estimation , 2006 .

[22]  David P. Larsen,et al.  Rare species in multivariate analysis for bioassessment: some considerations , 2001, Journal of the North American Benthological Society.

[23]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[24]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[25]  Hongmei Zang Designing sampling plans to capture rare objects , 2011 .

[26]  A. Chao Estimating the population size for capture-recapture data with unequal catchability. , 1987, Biometrics.

[27]  Michael I. Jordan,et al.  Nonparametric empirical Bayes for the Dirichlet process mixture model , 2006, Stat. Comput..

[28]  R. Green,et al.  Sampling to Detect Rare Species. , 1993, Ecological applications : a publication of the Ecological Society of America.

[29]  D. B. Dahl Bayesian Inference for Gene Expression and Proteomics: Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model , 2006 .

[30]  Hongmei Zhang Inferences on the Number of Unseen Species and the Number of Abundant/Rare Species , 2007 .

[31]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[32]  J. Norris,et al.  NONPARAMETRIC MLE UNDER TWO CLOSED CAPTURE-RECAPTURE MODELS WITH HETEROGENEITY , 1996 .

[33]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[34]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[35]  H. Stern,et al.  Sample Size Calculation for Finding Unseen Species , 2009 .

[36]  Jun S. Liu Nonparametric hierarchical Bayes via sequential imputations , 1996 .

[37]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[38]  Jeffrey S. Morris,et al.  Bayesian Shrinkage Estimation of the Relative Abundance of mRNA Transcripts Using SAGE , 2003, Biometrics.

[39]  M. Slatkin,et al.  Using maximum likelihood to estimate population size from temporal changes in allele frequencies. , 1999, Genetics.

[40]  S. MacEachern,et al.  Estimating mixture of dirichlet process models , 1998 .

[41]  C. Mao ESTIMATING THE NUMBER OF SPECIES WITH MULTIPLE INCIDENCE-BASED SUBSAMPLES , 2007 .

[42]  J. Norris,et al.  Capture-Recapture Models with Heterogeneity : I . Cormack-Jolly-Seber Model , 2003 .

[43]  Steven K. Thompson,et al.  Adaptive Cluster Sampling , 1990 .

[44]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[45]  D. Böhning,et al.  Nonparametric maximum likelihood estimation of population size based on the counting distribution , 2005 .

[46]  Hani Doss,et al.  HYPERPARAMETER AND MODEL SELECTION FOR NONPARAMETRIC BAYES PROBLEMS VIA RADON-NIKODYM DERIVATIVES , 2012 .

[47]  Ji-ping Wang,et al.  Estimating species richness by a Poisson-compound gamma model. , 2010, Biometrika.

[48]  J. Borkowski,et al.  A review of adaptive cluster sampling: 1990–2003 , 2005, Environmental and Ecological Statistics.

[49]  Christopher Quince,et al.  The rational exploration of microbial diversity , 2008, The ISME Journal.

[50]  Chang Xuan Mao,et al.  Estimating population sizes for capture-recapture sampling with binomial mixtures , 2007, Comput. Stat. Data Anal..

[51]  M. Christman,et al.  Inverse Adaptive Cluster Sampling , 2001, Biometrics.

[52]  A strategy for baseline monitoring of estuary Special Protection Areas , 2005 .

[53]  John Bunge,et al.  Estimating the Number of Species in a Stochastic Abundance Model , 2002, Biometrics.

[54]  A. Lijoi,et al.  On some issues related to species sampling problems , 2011 .

[55]  Lorenzo Trippa,et al.  False discovery rates in somatic mutation studies of cancer , 2011, 1107.4843.

[56]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[57]  Sujit K. Ghosh,et al.  Bayesian capture-recapture analysis and model selection allowing for heterogeneity and behavioral effects , 2005 .

[58]  P. Montagna,et al.  Direct and indirect effects of hypoxia on benthos in Corpus Christi Bay, Texas, U.S.A. , 2006 .

[59]  H. Ducklow,et al.  Correction: A Method for Studying Protistan Diversity Using Massively Parallel Sequencing of V9 Hypervariable Regions of Small-Subunit Ribosomal RNA Genes , 2009, PLoS ONE.