CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

There is a widespread need for statistical methods that can analyze high-dimensional datasets with- out imposing restrictive or opaque modeling assumptions. This paper describes a domain-general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparamet- ric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian net- work structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.

[1]  Nicolas Ulmer,et al.  The Cost Conundrum , 2010 .

[2]  Russell S. Kirby,et al.  The Dartmouth Atlas of Health Care , 1998 .

[3]  G. Box Robustness in the Strategy of Scientific Model Building. , 1979 .

[4]  Yura N. Perov,et al.  Venture: a higher-order probabilistic programming platform with programmable inference , 2014, ArXiv.

[5]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[6]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[7]  H. Thompson,et al.  Jump‐Diffusion Processes and the Term Structure of Interest Rates , 1988 .

[8]  Haydn Bush,et al.  The cost conundrum. , 2008, Hospitals & health networks.

[9]  Nir Friedman,et al.  Learning Hidden Variable Networks: The Information Bottleneck Approach , 2005, J. Mach. Learn. Res..

[10]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[11]  Ying Cui,et al.  Non-redundant Multi-view Clustering via Orthogonalization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[12]  L. Wasserman Low Assumptions, High Dimensions , 2011 .

[13]  Jiahui Wang,et al.  A Bayesian Time Series Model of Multiple Structural Changes in Level, Trend, and Variance , 2000 .

[14]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[15]  John Geweke,et al.  Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments , 1991 .

[16]  Michael I. Jordan,et al.  Multiple Non-Redundant Spectral Clustering Views , 2010, ICML.

[17]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[18]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[19]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[20]  Peter Bühlmann,et al.  Missing values: sparse inverse covariance estimation and an extension to sparse regression , 2009, Statistics and Computing.

[21]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[22]  Kaushik Ghosh,et al.  Nested Partition Models , 2009 .

[23]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[24]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[25]  Vikash K. Mansinghka,et al.  Cross-Categorization : A Method for Discovering Multiple Overlapping Clusterings , 2009 .

[26]  Zoubin Ghahramani,et al.  Variational Inference for Nonparametric Multiple Clustering , 2010 .

[27]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[28]  Michael I. Jordan Hierarchical Models , Nested Models and Completely Random Measures , 2010 .

[29]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[30]  Ronenn Roubenoff,et al.  Convergent Random Forest predictor: methodology for predicting drug response from genome-scale data applied to anti-TNF response. , 2009, Genomics.

[31]  Katherine A. Heller,et al.  Bayesian Sets , 2005, NIPS.

[32]  Ammarin Thakkinstian,et al.  How to use an article about genetic association: B: Are the results of the study valid? , 2009, JAMA.

[33]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the lasso , 2007, 0708.3517.

[34]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[35]  Joshua B. Tenenbaum,et al.  AClass: A simple, online, parallelizable algorithm for probabilistic classification , 2007, AISTATS.

[36]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[37]  John E. Wennberg,et al.  Tracking the Care of Patients with Severe Chronic Illness: The Dartmouth Atlas of Health Care 2008 , 2008 .

[38]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[39]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[40]  L. Shapley,et al.  Statistics, probability, and game theory : papers in honor of David Blackwell , 1999 .

[41]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[42]  Richard S. Zemel,et al.  Learning Parts-Based Representations of Data , 2006, J. Mach. Learn. Res..

[43]  Nir Friedman,et al.  Discovering Hidden Variables: A Structure-Based Approach , 2000, NIPS.

[44]  J. Rosenthal,et al.  Markov Chain Monte Carlo , 2018 .

[45]  Elliott Fisher,et al.  Health Care Spending , Quality , and Outcomes More Isn ’ t Always Better , 2009 .

[46]  Joshua B. Tenenbaum,et al.  A probabilistic model of cross-categorization , 2011, Cognition.

[47]  Patrick Shafto,et al.  Bayesian Hierarchical Cross-Clustering , 2011, AISTATS.

[48]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[49]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[50]  P. Green,et al.  Decomposable graphical Gaussian model determination , 1999 .

[51]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[52]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[53]  Eric Jonas,et al.  Scaling Nonparametric Bayesian Inference via Subsample-Annealing , 2014, AISTATS.

[54]  J. Pitman Some developments of the Blackwell-MacQueen urn scheme , 1996 .

[55]  J. Tenenbaum,et al.  Generalization, similarity, and Bayesian inference. , 2001, The Behavioral and brain sciences.

[56]  D. Beer,et al.  MicroRNA classifiers for predicting prognosis of squamous cell lung cancer. , 2009, Cancer research.

[57]  Daniel M. Roy,et al.  AClass : An online algorithm for generative classification , 2007 .

[58]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[59]  Vikash K. Mansinghka,et al.  Learning Cross-cutting Systems of Categories , 2006 .

[60]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[61]  Joshua B. Tenenbaum,et al.  Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs , 2013, NIPS.