Learning Metrics and Discriminative Clustering

In this work methods have been developed to extract relevant information from large, multivariate data sets in a flexible, nonlinear way. The techniques are applicable especially at the initial, explorative phase of data analysis, in cases where an explicit indicator of relevance is available as part of the data set. The unsupervised learning methods, popular in data exploration, often rely on a distance measure defined for data items. Selection of the distance measure, part of which is feature selection, is therefore fundamentally important. The learning metrics principle is introduced to complement manual feature selection by enabling automatic modification of a distance measure on the basis of available relevance information. Two applications of the principle are developed. The first emphasizes relevant aspects of the data by directly modifying distances between data items, and is usable, for example, in information visualization with the self-organizing maps. The other method, discriminative clustering , finds clusters that are internally homogeneous with respect to the interesting variation of the data. The techniques have been applied to text document analysis, gene expression clustering, and charting the bankruptcy sensitivity of companies. In the first, more straightforward approach, a new local metric of the data space measures changes in the conditional distribution of the relevance-indicating data by the Fisher information matrix, a local approximation of the KullbackLeibler distance. Discriminative clustering, on the other hand, directly minimizes a Kullback-Leibler based distortion measure within the clusters, or equivalently maximizes the mutual information between the clusters and the relevance indicator. A finite-data algorithm for discriminative clustering is also presented. It maximizes a partially marginalized posterior probability of the model and is asymptotically equivalent to maximizing mutual information. c ©All rights reserved. No part of the publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the author.

[1]  Samuel Kaski,et al.  Bankruptcy analysis with self-organizing maps in learning metrics , 2001, IEEE Trans. Neural Networks.

[2]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[3]  Samuel Kaski,et al.  Clustering by Similarity in an Auxiliary Space , 2000, IDEAL.

[4]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[5]  T. Heskes Energy functions for self-organizing maps , 1999 .

[6]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[7]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[8]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[9]  Trevor Hastie,et al.  Flexible discriminant and mixture models , 2000 .

[10]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[11]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[12]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Naftali Tishby,et al.  Sufficient Dimensionality Reduction , 2003, J. Mach. Learn. Res..

[14]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[15]  Volker Roth,et al.  Nonlinear Discriminant Analysis Using Kernel Functions , 1999, NIPS.

[16]  Samuel Kaski,et al.  Discriminative clustering of text documents , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[17]  Si Wu,et al.  Improving support vector machine classifiers by modifying kernel functions , 1999, Neural Networks.

[18]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[19]  Michael E. Tipping Deriving cluster analytic distance functions from Gaussian mixture models , 1999 .

[20]  E. Altman Corporate financial distress : a complete guide to predicting, avoiding, and dealing with bankruptcy , 1983 .

[21]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[22]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[24]  William M. Campbell,et al.  Mutual Information in Learning Feature Transformations , 2000, ICML.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[27]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[28]  Christopher M. Bishop,et al.  GTM: A Principled Alternative to the Self-Organizing Map , 1996, NIPS.

[29]  Juha Vesanto,et al.  Data exploration process based on the self-organizing map , 2002 .

[30]  Christopher M. Bishop,et al.  Developments of the generative topographic mapping , 1998, Neurocomputing.

[31]  M. Murray,et al.  Differential Geometry and Statistics , 1993 .

[32]  David E. Booth,et al.  Applied Multivariate Analysis , 2003, Technometrics.

[33]  Noam Slonim,et al.  Maximum Likelihood and the Information Bottleneck , 2002, NIPS.

[34]  I. Good On the Application of Symmetric Dirichlet Distributions and their Mixtures to Contingency Tables , 1976 .

[35]  Gal Chechik,et al.  Extracting Relevant Structures with Side Information , 2002, NIPS.

[36]  Samuel Kaski,et al.  Discriminative Clustering: Optimal Contingency Tables by Learning Metrics , 2002, ECML.

[37]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[38]  Samuel Kaski,et al.  Learning More Accurate Metrics for Self-Organizing Maps , 2002, ICANN.

[39]  Samuel Kaski,et al.  A Topography-Preserving Latent Variable Model with Learning Metrics , 2001, WSOM.

[40]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[41]  Henry Tirri,et al.  Unsupervised Bayesian visualization of high-dimensional data , 2000, KDD '00.

[42]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[43]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[44]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[45]  Roger Smith,et al.  A history of psychology: main currents in psychological thought , 1982, Medical History.

[46]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[47]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[48]  Thomas Hofmann,et al.  Learning from Dyadic Data , 1998, NIPS.

[49]  B. Efron The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis , 1975 .

[50]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[51]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[52]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[53]  Samuel Kaski,et al.  Learning Metrics for Visualizing Gene Functional Similarities , 2002 .

[54]  K. Torkkola,et al.  Nonlinear feature transforms using maximum mutual information , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[55]  James J. Filliben,et al.  NIST/SEMATECH e-Handbook of Statistical Methods; Chapter 1: Exploratory Data Analysis , 2003 .

[56]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[57]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[58]  Samuel Kaski,et al.  Principle of Learning Metrics for Exploratory Data Analysis , 2004, J. VLSI Signal Process..

[59]  Henry Tirri,et al.  Supervised model-based visualization of high-dimensional data , 2000, Intell. Data Anal..

[60]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[61]  Samuel Kaski,et al.  Discriminative Clustering: Vector Quantization in Learning Metrics , 2003 .

[62]  Jim Kay,et al.  Feature discovery under contextual supervision using mutual information , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[63]  Samuel Kaski,et al.  Regularized discriminative clustering , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).

[64]  A. I.,et al.  Neural Field Continuum Limits and the Structure–Function Partitioning of Cognitive–Emotional Brain Networks , 2023, Biology.

[65]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[66]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[67]  Kimmo Kiviluoto,et al.  Predicting bankruptcies with the self-organizing map , 1998, Neurocomputing.

[68]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[69]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[70]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.

[71]  Edward I. Altman,et al.  FINANCIAL RATIOS, DISCRIMINANT ANALYSIS AND THE PREDICTION OF CORPORATE BANKRUPTCY , 1968 .

[72]  Wray L. Buntine Variational Extensions to EM and Multinomial PCA , 2002, ECML.

[73]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[74]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[75]  E. T. Jaynes,et al.  Where do we Stand on Maximum Entropy , 1979 .

[76]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[77]  Samuel Kaski,et al.  Discriminative Clustering in Fisher Metrics , 2003 .

[78]  Kari Torkkola,et al.  Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions , 2001, NIPS.

[79]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[80]  Dimitrios Gunopulos,et al.  Locally Adaptive Metric Nearest-Neighbor Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[81]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[82]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[83]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[84]  R. Kass,et al.  Geometrical Foundations of Asymptotic Inference , 1997 .

[85]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[86]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[87]  Trevor J. Hastie,et al.  Discriminative vs Informative Learning , 1997, KDD.

[88]  Ted Chang Geometrical foundations of asymptotic inference , 2002 .

[89]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[90]  L. A. Goodman The Analysis of Cross-Classified Data Having Ordered and/or Unordered Categories: Association Models, Correlation Models, and Asymmetry Models for Contingency Tables With or Without Missing Entries , 1985 .

[91]  J.C. Principe,et al.  A methodology for information theoretic feature extraction , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[92]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[93]  Dimitrios Gunopulos,et al.  An Adaptive Metric Machine for Pattern Classification , 2000, NIPS.

[94]  Zoubin Ghahramani,et al.  Probabilistic Models for Unsupervised Learning , 1999 .

[95]  A. Dale Magoun,et al.  Decision, estimation and classification , 1989 .

[96]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[97]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[98]  Geoffrey Hunter What Computers Can't Do , 1988, Philosophy.

[99]  Naftali Tishby,et al.  Objective Classification of Galaxy Spectra using the Information Bottleneck Method , 2000, astro-ph/0005306.