Mixture modeling, sparse covariance estimation and parallel computing in bayesian analysis

Mixture modeling of continuous data is an extremely effective and popular method for density estimation and clustering. However as the size of the data grows, both in terms of dimension and number of observations, many modeling and computational problems arise. In the Bayesian setting, computational methods for posterior inference become intractable as the number of observations and/or possible clusters gets large. Furthermore, relabeling in sampling methods is increasingly difficult to address as the data gets large. This thesis addresses computational and methodological solutions to these problems by utilizing modern computational hardware and new methodology. Novel approaches for parsimonious covariance modeling and information sharing across multiple data sets are then built upon these computational improvements. Chapter 1 introduces the fundamental modeling approaches in mixture modeling including Dirichlet processes and posterior inference using Gibbs sampling. Chapter 2 describes the utilization of graphical processing units for massive gains in computational performance in both mixture models and general Bayesian modeling. Chapter 3 introduces a new relabeling approach in mixture modeling that can be scaled far beyond current methodology to massive data and high dimensional settings. Chapter 4 generalizes chapters 2 and 3 to the hierarchical Dirichlet process setting to "borrow strength" from multiple studies in classification problems in flow cytometry. Chapter 5 develops a novel approach for sparse covariance estimation using sparse, full rank, orthogonal matrix estimation. These new methods are applied to a mixture modeling with measurement error setting for classification. Finally, Chapter 6 summarizes the work given in this thesis and outlines exciting areas for future research.

[1]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[2]  D. R. Fulkerson,et al.  Incidence matrices and interval graphs , 1965 .

[3]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[4]  T. W. Anderson,et al.  Generation of random orthogonal matrices , 1987 .

[5]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[6]  M. West,et al.  A Bayesian method for classification and discrimination , 1992 .

[7]  Nicholas I. Fisher,et al.  Statistical Analysis of Circular Data , 1993 .

[8]  J. Berger,et al.  Estimation of a Covariance Matrix Using the Reference Prior , 1994 .

[9]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[10]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[11]  Michael A. West,et al.  Hierarchical Mixture Models in Neurological Transmission Analysis , 1997 .

[12]  S. MacEachern,et al.  Estimating mixture of dirichlet process models , 1998 .

[13]  Steven N. MacEachern,et al.  Computational Methods for Mixture of Dirichlet Process Models , 1998 .

[14]  成川 公一,et al.  in vitro と in vivo の成績の関連 , 1998 .

[15]  M. Stephens Dealing with label switching in mixture models , 2000 .

[16]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[17]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[18]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[20]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[21]  Michael A. West,et al.  Archival Version including Appendicies : Experiments in Stochastic Computation for High-Dimensional Graphical Models , 2005 .

[22]  Therese Sørlie,et al.  Molecular portraits of breast cancer: tumour subtypes as distinct disease entities. , 2004, European journal of cancer.

[23]  M. West,et al.  Gene Expression Phenotypes of Atherosclerosis , 2004, Arteriosclerosis, thrombosis, and vascular biology.

[24]  Michael A. West,et al.  Covariance decomposition in undirected Gaussian graphical models , 2005 .

[25]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[26]  Pedro Trancoso,et al.  Initial Experiences Porting a Bioinformatics Application to a Graphics Processor , 2005, Panhellenic Conference on Informatics.

[27]  Carlos M. Carvalho,et al.  Sparse Statistical Modelling in Gene Expression Genomics , 2006 .

[28]  ModelsSteven N. MacEachern Estimating Mixture of Dirichlet Process , 2006 .

[29]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[30]  Peter D. Hoff,et al.  Simulation of the Matrix Bingham–von Mises–Fisher Distribution, With Applications to Multivariate and Relational Data , 2007, 0712.4166.

[31]  Adrian E. Raftery,et al.  Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering , 2007, J. Classif..

[32]  M. West,et al.  High-dimensional Regression in Cancer Genomics , 2007 .

[33]  Anjul Patney,et al.  Efficient computation of sum-products on GPUs through software-managed cache , 2008, ICS '08.

[34]  Cliburn Chan,et al.  Statistical mixture modeling for cell subtype identification in flow cytometry , 2008, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[35]  C. Carvalho,et al.  In-Vitro to In-Vivo Factor Profiling in Expression Genomics , 2008 .

[36]  Dongchu Sun,et al.  Objective priors for the bivariate normal model , 2008, 0804.0987.

[37]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[38]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[39]  Nicolas Pinto,et al.  PyCUDA: GPU Run-Time Code Generation for High-Performance Computing , 2009, ArXiv.

[40]  Raphael Gottardo,et al.  flowClust: a Bioconductor package for automated gating of flow cytometry data , 2009, BMC Bioinformatics.

[41]  B. Lindsay,et al.  Bayesian Mixture Labeling by Highest Posterior Density , 2009 .

[42]  James G. Scott,et al.  Objective Bayesian model selection in Gaussian graphical models , 2009 .

[43]  M. West,et al.  Cross-Study Projections of Genomic Biomarkers: An Evaluation in Cancer Genomics , 2009, PloS one.

[44]  Mohsen Pourahmadi,et al.  Modeling covariance matrices via partial autocorrelations , 2009, J. Multivar. Anal..

[45]  Marc A. Suchard,et al.  Many-core algorithms for statistical phylogenetics , 2009, Bioinform..

[46]  M. West,et al.  A Bayesian Analysis Strategy for Cross-Study Translation of Gene Expression Biomarkers , 2009, Statistical applications in genetics and molecular biology.

[47]  J. Mesirov,et al.  Automated high-dimensional flow cytometric data analysis , 2009, Proceedings of the National Academy of Sciences.

[48]  Greg Finak,et al.  Merging Mixture Components for Cell Population Identification in Flow Cytometry , 2009, Adv. Bioinformatics.

[49]  Ryo Yoshida,et al.  Bayesian Learning in Sparse Graphical Factor Models via Annealed Entropy , 2010 .

[50]  Cliburn Chan,et al.  Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[51]  Cliburn Chan,et al.  Optimization of a highly standardized carboxyfluorescein succinimidyl ester flow cytometry panel and gating strategy design using discriminative information measure evaluation , 2010, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[52]  Arnaud Doucet,et al.  On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods , 2009, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[53]  Mike West,et al.  Efficient Classification-Based Relabeling in Mixture Models , 2011, The American statistician.

[54]  Wes McKinney,et al.  gpustats: GPU Library for Statistical Computing in Python , 2011 .

[55]  Abel Rodriguez,et al.  Bayesian Inference for General Gaussian Graphical Models With Application to Multivariate Lattice Data , 2010, Journal of the American Statistical Association.

[56]  Christopher Holmes,et al.  Some of the What?, Why?, How?, Who? and Where? of Graphics Processing Unit Computing for Bayesian Analysis , 2011 .

[57]  Abel Rodríguez,et al.  Sparse covariance estimation in heterogeneous samples. , 2010, Electronic journal of statistics.

[58]  Jessica Tressou,et al.  Bayesian nonparametric model for clustering individual co-exposure to pesticides found in the French diet. , 2011 .

[59]  John Geweke,et al.  Massively Parallel Sequential Monte Carlo for Bayesian Inference , 2011 .

[60]  Cedrik M. Britten,et al.  The development of standard samples with a defined number of antigen-specific T cells to harmonize T cell assays: a proof-of-principle study , 2012, Cancer Immunology, Immunotherapy.

[61]  M. West,et al.  Models of Random Sparse Eigenmatrices with Application to Bayesian Factor Analysis , 2012 .

[62]  Cedrik M. Britten,et al.  Harmonization of the intracellular cytokine staining assay , 2012, Cancer Immunology, Immunotherapy.

[63]  M. West,et al.  Dynamic Factor Volatility Modeling: A Bayesian Latent Threshold Approach , 2013 .

[64]  Patrick B. Ryan,et al.  Massive Parallelization of Serial Inference Algorithms for a Complex Generalized Linear Model , 2012, TOMC.

[65]  Cliburn Chan,et al.  Hierarchical Modeling for Rare Event Detection and Cell Subset Alignment across Flow Cytometry Samples , 2013, PLoS Comput. Biol..

[66]  James O. Berger,et al.  Objective Priors for the Multivariate Normal Model , 2022 .