Clustering by Similarity in an Auxiliary Space

We present a clustering method for continuous data. It defines local clusters into the (primary) data space but derives its similarity measure from the posterior distributions of additional discrete data that occur as pairs with the primary data. As a case study, enterprises are clustered by deriving the similarity measure from bankruptcy sensitivity. In another case study, a content-based clustering for text documents is found by measuring differences between their metadata (keyword distributions). We show that minimizing our Kullback-Leibler divergence-based distortion measure within the categories is equivalent to maximizing the mutual information between the categories and the distributions in the auxiliary space. A simple on-line algorithm for minimizing the distortion is introduced for Gaussian basis functions and their analogs on a hypersphere.

[1]  K. Mardia Statistics of Directional Data , 1972 .

[2]  Trevor Hastie,et al.  Flexible discriminant and mixture models , 2000 .

[3]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  Kanti V. Mardia,et al.  Statistics of Directional Data , 1972 .

[6]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[7]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[8]  Samuel Kaski,et al.  Metrics that learn relevance , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[9]  Thomas Hofmann,et al.  Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization , 1999, NIPS.

[10]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.