Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection

Estimating the number of components (the order) in a mixture model is often addressed using criteria such as the Bayesian information criterion (BIC) and minimum message length. However, when the feature space is very large, use of these criteria may grossly underestimate the order. Here, it is suggested that this failure is not mainly attributable to the criterion (e.g., BIC), but rather to the lack of "structure" in standard mixtures-these models trade off data fitness and model complexity only by varying the order. The authors of the present paper propose mixtures with a richer set of tradeoffs. The proposed model allows each component its own informative feature subset, with all other features explained by a common model (shared by all components). Parameter sharing greatly reduces complexity at a given order. Since the space of these parsimonious modeling solutions is vast, this space is searched in an efficient manner, integrating the component and feature selection within the generalized expectation-maximization (GEM) learning for the mixture parameters. The quality of the proposed (unsupervised) solutions is evaluated using both classification error and test set data likelihood. On text data, the proposed multinomial version-learned without labeled examples, without knowing the "true" number of topics, and without feature preprocessing-compares quite favorably with both alternative unsupervised methods and with a supervised naive Bayes classifier. A Gaussian version compares favorably with a recent method introducing "feature saliency" in mixtures.

[1]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  Ran El-Yaniv,et al.  Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.

[7]  Shivakumar Vaithyanathan,et al.  Generalized Model Selection for Unsupervised Learning in High Dimensions , 1999, NIPS.

[8]  Xin Liu,et al.  Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[9]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[10]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[11]  Ashwin Ram,et al.  Efficient Feature Selection in Conceptual Clustering , 1997, ICML.

[12]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[13]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[14]  H. Akaike A new look at the statistical model identification , 1974 .

[15]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Ronald A. Cole,et al.  Spoken Letter Recognition , 1990, HLT.

[18]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[19]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[20]  P. Deb Finite Mixture Models , 2008 .

[21]  Josef Kittler,et al.  Divergence Based Feature Selection for Multimodal Class Densities , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[23]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[24]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Nir Friedman,et al.  Learning Module Networks , 2002, J. Mach. Learn. Res..

[26]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[27]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[28]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[29]  David J. Miller,et al.  A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.