FastEx: Hash Clustering with Exponential Families

Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters.

[1]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[2]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[5]  Deepak Agarwal,et al.  Predictive discrete latent factor models for large scale dyadic data , 2007, KDD '07.

[6]  Amos Fiat,et al.  Correlation Clustering - Minimizing Disagreements on Arbitrary Weighted Graphs , 2003, ESA.

[7]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[10]  David M. Blei,et al.  Sparse stochastic inference for latent Dirichlet allocation , 2012, ICML.

[11]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[12]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[13]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[14]  Alexander J. Smola,et al.  Unified analysis of streaming news , 2011, WWW.

[15]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[16]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[17]  Andrew Y. Ng,et al.  Fast Gaussian Process Regression using KD-Trees , 2005, NIPS.

[18]  Michael I. Jordan Graphical Models , 2003 .

[19]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[20]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[22]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[23]  Alexander J. Smola,et al.  Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text , 2011, AISTATS.