Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures

This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components. We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplemental materials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context.

[1]  Leonore A Herzenberg,et al.  Interpreting flow cytometry data: a guide for the perplexed , 2006, Nature Immunology.

[2]  S. MacEachern,et al.  Estimating mixture of dirichlet process models , 1998 .

[3]  W. H. Mac Williams Keynote address , 2006, AIEE-IRE '51.

[4]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[5]  John Ferbas,et al.  Mixture modeling approach to flow cytometry data , 2008, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[6]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[7]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[8]  Steven N. MacEachern,et al.  Computational Methods for Mixture of Dirichlet Process Models , 1998 .

[9]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[10]  O. Ornatsky,et al.  Multiple cellular antigen detection by ICP-MS. , 2006, Journal of immunological methods.

[11]  Raphael Gottardo,et al.  Automated gating of flow cytometry data via robust model‐based clustering , 2008, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[12]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[13]  Marc A. Suchard,et al.  Many-core algorithms for statistical phylogenetics , 2009, Bioinform..

[14]  Cliburn Chan,et al.  Statistical mixture modeling for cell subtype identification in flow cytometry , 2008, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[15]  Pedro Trancoso,et al.  Initial Experiences Porting a Bioinformatics Application to a Graphics Processor , 2005, Panhellenic Conference on Informatics.

[16]  Anjul Patney,et al.  Efficient computation of sum-products on GPUs through software-managed cache , 2008, ICS '08.

[17]  Arnaud Doucet,et al.  On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods , 2009, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[18]  Mike West,et al.  Spatial Mixture Modelling for Unobserved Point Processes: Examples in Immunofluorescence Histology. , 2009, Bayesian analysis.

[19]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[20]  Jill P. Mesirov,et al.  Automated High-Dimensional Flow Cytometric Data Analysis , 2010, RECOMB.

[21]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .