Clustering Based on Conditional Distributions in an Auxiliary Space

We study the problem of learning groups or categories that are local in the continuous primary space but homogeneous by the distributions of an associated auxiliary random variable over a discrete auxiliary space. Assuming that variation in the auxiliary space is meaningful, categories will emphasize similarly meaningful aspects of the primary space. From a data set consisting of pairs of primary and auxiliary items, the categories are learned by minimizing a Kullback-Leibler divergence-based distortion between (implicitly estimated) distributions of the auxiliary data, conditioned on the primary data. Still, the categories are defined in terms of the primary space. An online algorithm resembling the traditional Hebb-type competitive learning is introduced for learning the categories. Minimizing the distortion criterion turns out to be equivalent to maximizing the mutual information between the categories and the auxiliary data. In addition, connections to density estimation and to the distributional clustering paradigm are outlined. The method is demonstrated by clustering yeast gene expression data from DNA chips, with biological knowledge about the functional classes of the genes as the auxiliary data.

[1]  Allen Gersho,et al.  Asymptotically optimal block quantization , 1979, IEEE Trans. Inf. Theory.

[2]  Samuel Kaski,et al.  Winner-take-all networks for physiological models of competitive learning , 1994, Neural Networks.

[3]  Thomas Hofmann,et al.  Learning from Dyadic Data , 1998, NIPS.

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Thomas Villmann,et al.  Generalized relevance learning vector quantization , 2002, Neural Networks.

[6]  Samuel Kaski,et al.  Bankruptcy analysis with self-organizing maps in learning metrics , 2001, IEEE Trans. Neural Networks.

[7]  Barbara Hammer,et al.  Unsupervised Recursive Sequence Processing , 2003, ESANN.

[8]  S. Grossberg On the development of feature detectors in the visual cortex with applications to learning and reaction-diffusion systems , 1976, Biological Cybernetics.

[9]  Teuvo Kohonen,et al.  Physiological interpretationm of the self-organizing map algorithm , 1993 .

[10]  Kanti V. Mardia,et al.  Statistics of Directional Data , 1972 .

[11]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[12]  Steven J. Nowlan,et al.  Maximum Likelihood Competitive Learning , 1989, NIPS.

[13]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[16]  R. Pérez,et al.  Development of Specificity in the Cat Visual Cortex , 1975, Journal of mathematical biology.

[17]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[18]  L. Cooper,et al.  A theory for the development of feature detecting cells in visual cortex , 1975, Biological Cybernetics.

[19]  Teuvo Kohonen,et al.  Physiological interpretation of the Self-Organizing Map algorithm , 1993, Neural Networks.

[20]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[21]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[22]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[23]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  J.C. Principe,et al.  A methodology for information theoretic feature extraction , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[26]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[27]  Thomas Hofmann,et al.  Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization , 1999, NIPS.

[28]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[29]  Teuvo Kohonen,et al.  Where the abstract feature maps of the brain might come from , 1999, Trends in Neurosciences.

[30]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[31]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[32]  R. Didday A model of visuomotor mechanisms in the frog optic tectum , 1976 .

[33]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[34]  William M. Campbell,et al.  Mutual Information in Learning Feature Transformations , 2000, ICML.

[35]  Trevor Hastie,et al.  Flexible discriminant and mixture models , 2000 .

[36]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[37]  Michael E. Tipping Deriving cluster analytic distance functions from Gaussian mixture models , 1999 .