Bayesian mixture models and their Big Data implementations with application to invasive species presence-only data

Due to their conceptual simplicity and flexibility, non-parametric mixture models are widely used to identify latent clusters in data. However, when it comes to Big Data, such as Landsat imagery, such model fitting is computationally prohibitive. To overcome this issue, we fit Bayesian non-parametric models to pre-smoothed data, thereby reducing the computational time from days to minutes, while disregarding little of the useful information. Tree based clustering is used to partition the clusters into smaller and smaller clusters in order to identify clusters of high, medium and low interest. The tree-based clustering method is applied to Landsat images from the Brisbane region, which were the actual sources of motivation for development of the method. The images are taken as a part of the red imported fire-ant eradication program that was launched in September 2001 and which is funded by all Australian states and territories, along with the federal government. To satisfy budgetary constraints, modelling is performed to estimate the risk of fire-ant incursion in each cluster so that the eradication program focuses on high risk clusters. The likelihood of containment is successfully derived by combining the fieldwork survey data with the results obtained from the proposed method.

[1]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[2]  John W. Fisher,et al.  Parallel Sampling of DP Mixture Models using Sub-Cluster Splits , 2013, NIPS.

[3]  David Meyer,et al.  Support Vector Machines ∗ The Interface to libsvm in package , 2001 .

[4]  Richi Nayak,et al.  Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications , 2015, WWW.

[5]  Jean-Michel Marin,et al.  Approximate Bayesian computational methods , 2011, Statistics and Computing.

[6]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[7]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[8]  David Meyer,et al.  Support Vector Machines ∗ The Interface to libsvm in package e1071 , 2001 .

[9]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[10]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[11]  Arnaud Doucet,et al.  On Markov chain Monte Carlo methods for tall data , 2015, J. Mach. Learn. Res..

[12]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[13]  D. Spring,et al.  Estimating eradication probabilities and trade-offs for decision analysis in invasive species eradication programs , 2014, Biological Invasions.

[14]  Christopher C. Drovandi,et al.  Pre-processing for approximate Bayesian computation in image analysis , 2015, Stat. Comput..

[15]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[16]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[17]  Brendan A. Wintle,et al.  Is my species distribution model fit for purpose? Matching data and models to applications , 2015 .

[18]  Arnaud Doucet,et al.  On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods , 2009, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[19]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[20]  Lancelot F. James,et al.  Approximate Dirichlet Process Computing in Finite Normal Mixtures , 2002 .

[21]  Eric P. Xing,et al.  Parallel Markov Chain Monte Carlo for Nonparametric Mixture Models , 2013, ICML.

[22]  Andrew Gelman,et al.  Sampling for Bayesian Computation with Large Datasets , 2005 .

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[25]  Cliburn Chan,et al.  Selection Sampling from Large Data Sets for Targeted Inference in Mixture Modeling. , 2010, Bayesian analysis.

[26]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[27]  M. Wand,et al.  Explaining Variational Approximations , 2010 .

[28]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[29]  Trevor Hastie,et al.  Inference from presence-only data; the ongoing controversy. , 2013, Ecography.

[30]  D. M. Titterington,et al.  Variational approximations in Bayesian model selection for finite mixture distributions , 2007, Comput. Stat. Data Anal..

[31]  Bowei Xi,et al.  Large complex data: divide and recombine (D&R) with RHIPE , 2012 .