Bayesian nonparametric modelling of sequential discoveries

We aim at modelling the appearance of distinct tags in a sequence of labelled objects. Common examples of this type of data include words in a corpus or distinct species in a sample. These sequential discoveries are often summarised via accumulation curves, which count the number of distinct entities observed in an increasingly large set of objects. We propose a novel Bayesian nonparametric method for species sampling modelling by directly specifying the probability of a new discovery, therefore allowing for flexible specifications. The asymptotic behavior and finite sample properties of such an approach are extensively studied. Interestingly, our enlarged class of sequential processes includes highly tractable special cases. We present a subclass of models characterized by appealing theoretical and computational properties. Moreover, due to strong connections with logistic regression models, the latter subclass can naturally account for covariates. We finally test our proposal on both synthetic and real data, with special emphasis on a large fungal biodiversity study in Finland.

[1]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[2]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[3]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[5]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[6]  P. Somervuo,et al.  Fungal communities decline with urbanization—more in air than in soil , 2020, The ISME Journal.

[7]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[8]  Anne Chao,et al.  Nonparametric prediction in species sampling , 2004 .

[9]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[10]  Yili Hong,et al.  On computing the distribution function for the Poisson binomial distribution , 2013, Comput. Stat. Data Anal..

[11]  Z. Botev The normal law under linear restrictions: simulation and estimation via minimax tilting , 2016, 1603.04166.

[12]  J. Pitman Some developments of the Blackwell-MacQueen urn scheme , 1996 .

[13]  I. Ionita-Laza,et al.  Estimating the number of unseen variants in the human genome , 2009, Proceedings of the National Academy of Sciences.

[14]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[15]  Ramsés H. Mena,et al.  Bayesian Nonparametric Estimation of the Probability of Discovering New Species , 2007 .

[16]  S. Zabell W. E. Johnson's "Sufficientness" Postulate , 1982 .

[17]  J. Hughes,et al.  Counting the Uncountable: Statistical Approaches to Estimating Microbial Diversity , 2001, Applied and Environmental Microbiology.

[18]  J. Pitman,et al.  Exchangeable Gibbs partitions and Stirling triangles , 2004, math/0412494.

[19]  A. Chao,et al.  PREDICTING THE NUMBER OF NEW SPECIES IN FURTHER TAXONOMIC SAMPLING , 2003 .

[20]  C. Mao Predicting the Conditional Probability of Discovering a New Class , 2004 .

[21]  Stefano Favaro,et al.  A new estimator of the discovery probability. , 2012, Biometrics.

[22]  A. Lijoi,et al.  Bayesian nonparametric inference beyond the Gibbs‐type framework , 2018 .

[23]  M. Blaser,et al.  Molecular analysis of human forearm superficial skin bacterial biota , 2007, Proceedings of the National Academy of Sciences.

[24]  P. Müller,et al.  Defining Predictive Probability Functions for Species Sampling Models. , 2013, Statistical science : a review journal of the Institute of Mathematical Statistics.

[25]  P. Somervuo,et al.  Monitoring Fungal Communities With the Global Spore Sampling Project , 2020, Frontiers in Ecology and Evolution.

[26]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[27]  Matteo Ruggiero,et al.  Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process? , 2015, IEEE transactions on pattern analysis and machine intelligence.

[28]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[29]  R. M. Korwar,et al.  Contributions to the Theory of Dirichlet Processes , 1973 .

[30]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[31]  Bradley P. Carlin,et al.  Bayesian measures of model complexity and fit , 2002 .

[32]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[33]  Tommaso Rigon,et al.  The Pitman–Yor multinomial process for mixture modelling , 2020 .

[34]  Charalambos A. Charalambides,et al.  Combinatorial Methods in Discrete Distributions , 2005 .

[35]  Steven N. MacEachern,et al.  The Dependent Dirichlet Process and Related Models , 2020, Statistical Science.

[36]  Ramsés H. Mena,et al.  Bayesian non‐parametric inference for species variety with a two‐parameter Poisson–Dirichlet process prior , 2009 .

[37]  J. Pitman,et al.  Size-biased sampling of Poisson point processes and excursions , 1992 .