Simplex Factor Models for Multivariate Unordered Categorical Data

Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary, and ordered categorical data. For unordered categorical variables, Gaussian latent factor models lead to challenging computation and complex modeling structures. As an alternative, we propose a novel class of simplex factor models. In the single-factor case, the model treats the different categorical outcomes as independent with unknown marginals. The model can characterize flexible dependence structures parsimoniously with few factors, and as factors are added, any multivariate categorical data distribution can be accurately approximated. Using a Bayesian approach for computation and inferences, a Markov chain Monte Carlo (MCMC) algorithm is proposed that scales well with increasing dimension, with the number of factors treated as unknown. We develop an efficient proposal for updating the base probability vector in hierarchical Dirichlet models. Theoretical properties are described, and we evaluate the approach through simulation examples. Applications are described for modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features.

[1]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[2]  Paul F. Lazarsfeld,et al.  Latent Structure Analysis. , 1969 .

[3]  J. Ashford,et al.  Multi-variate probit analysis. , 1970, Biometrics.

[4]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[5]  John Aitchison,et al.  Polychotomous quantal response by maximum indicant , 1970 .

[6]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[7]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[8]  T. Ferguson Prior Distributions on Spaces of Probability Measures , 1974 .

[9]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[10]  F. Krauss Latent Structure Analysis , 1980 .

[11]  B. Muthén Latent variable structural equation modeling with categorical data , 1983 .

[12]  Ross L. Prentice,et al.  Likelihood inference in a correlated probit regression model , 1984 .

[13]  J. Loehlin Latent variable models , 1987 .

[14]  Kenneth A. Bollen,et al.  Structural Equations with Latent Variables , 1989 .

[15]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[16]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[17]  A. Dawid,et al.  Hyper Markov Laws in the Statistical Analysis of Decomposable Graphical Models , 1993 .

[18]  Joel E. Cohen,et al.  Nonnegative ranks, decompositions, and factorizations of nonnegative matrices , 1993 .

[19]  J. York,et al.  Bayesian Graphical Models for Discrete Data , 1995 .

[20]  P. Müller,et al.  Bayesian curve fitting using multivariate normal mixtures , 1996 .

[21]  L. Ryan,et al.  Latent Variable Models for Mixed Discrete and Continuous Outcomes , 1997 .

[22]  Charles E. Brown Multivariate Probit Analysis , 1998 .

[23]  Michael I. Jordan Graphical Models , 2003 .

[24]  S. Chib,et al.  Analysis of multivariate probit models , 1998 .

[25]  M. Veloso,et al.  Latent Variable Models , 2019, Statistical and Econometric Methods for Transportation Data Analysis.

[26]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[27]  D. Dunson,et al.  Bayesian latent variable models for clustered mixed outcomes , 2000 .

[28]  M. Knott,et al.  Generalized latent trait models , 2000 .

[29]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[30]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[31]  Tamara G. Kolda,et al.  Orthogonal Tensor Decompositions , 2000, SIAM J. Matrix Anal. Appl..

[32]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[33]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[34]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[35]  D. Dunson Dynamic Latent Trait Models for Multidimensional Longitudinal Data , 2003 .

[36]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[37]  Michael A. West,et al.  Archival Version including Appendicies : Experiments in Stochastic Computation for High-Dimensional Graphical Models , 2005 .

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  Michael A. West,et al.  BAYESIAN MODEL ASSESSMENT IN FACTOR ANALYSIS , 2004 .

[40]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[41]  Tamir Hazan,et al.  Non-negative tensor factorization with applications to statistics and computer vision , 2005, ICML.

[42]  Narendra Ahuja,et al.  Rank-R approximation of tensors using image-as-matrix representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[43]  J. E. Griffin,et al.  Order-Based Dependent Dirichlet Processes , 2006 .

[44]  T. Belin,et al.  Sampling Correlation Matrices in Bayesian Models With Correlated Latent Variables , 2006 .

[45]  M. Pitt,et al.  Efficient Bayesian inference for Gaussian copula regression models , 2006 .

[46]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[47]  Z. Weng,et al.  A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome , 2006, Cell.

[48]  N. Pillai,et al.  Bayesian density regression , 2007 .

[49]  Stephen E. Fienberg,et al.  Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation , 2007 .

[50]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[51]  Seungjin Choi,et al.  Nonnegative Tucker Decomposition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  S. Fienberg,et al.  DESCRIBING DISABILITY THROUGH INDIVIDUAL-LEVEL MIXTURE MODELS FOR MULTIVARIATE BINARY DATA. , 2007, The annals of applied statistics.

[53]  Zhi Geng,et al.  A Recursive Method for Structural Learning of Directed Acyclic Graphs , 2008, J. Mach. Learn. Res..

[54]  Xiao Zhang,et al.  Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models , 2008, Comput. Stat. Data Anal..

[55]  Xinsheng Liu,et al.  The Monte Carlo EM method for estimating multinomial probit latent variable models , 2008, Comput. Stat..

[56]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[57]  D. Dunson,et al.  Kernel stick-breaking processes. , 2008, Biometrika.

[58]  Adrian Dobra,et al.  Copula Gaussian Graphical Models * , 2009 .

[59]  James G. Scott,et al.  Objective Bayesian model selection in Gaussian graphical models , 2009 .

[60]  Daniel Cooley,et al.  Modelling pairwise dependence of maxima in space , 2009 .

[61]  H. Massam,et al.  A conjugate prior for discrete hierarchical log-linear models , 2006, 0711.1609.

[62]  D. Dunson Nonparametric Bayes local partition models for random effects. , 2009, Biometrika.

[63]  D. Dunson,et al.  Nonparametric Bayes Conditional Distribution Modeling With Variable Selection , 2009, Journal of the American Statistical Association.

[64]  Luc T. Ikelle,et al.  Appendix B - Nonnegative Tensor Factorization , 2010 .

[65]  H. Massam,et al.  The mode oriented stochastic search (MOSS) algorithm for log-linear models with conjugate priors , 2010 .

[66]  A. Dobra,et al.  Copula Gaussian graphical models and their application to modeling functional disability data , 2011, 1108.1680.

[67]  Adrian Dobra,et al.  Computational Aspects Related to Inference in Gaussian Graphical Models With the G-Wishart Prior , 2011 .

[68]  David B Dunson,et al.  Nonparametric Bayesian models through probit stick-breaking processes. , 2011, Bayesian analysis.

[69]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[70]  David B. Dunson,et al.  Posterior consistency in conditional distribution estimation , 2013, J. Multivar. Anal..

[71]  Andriy Norets,et al.  POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES , 2011, Econometric Theory.