Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization

The project pursued in this paper is to develop from first information-geometric principles a general method for learning the similarity between text documents. Each individual document is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a low-dimensional (curved) multinomial subfamily is learned. From this model a canonical similarity function - known as the Fisher kernel-is derived. Our approach can be applied for unsupervised and supervised learning problems alike. This in particular covers interesting cases where both, labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of the proposed method.

[1]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[2]  S. Haberman,et al.  Canonical Analysis of Contingency Tables by Maximum Likelihood , 1986 .

[3]  Michael Evans,et al.  Latent class analysis of two-way contingency tables by Bayesian methods , 1989 .

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[6]  M. Murray,et al.  Differential Geometry and Statistics , 1993 .

[7]  P. Marriott DIFFERENTIAL GEOMETRY AND STATISTICS , 1995 .

[8]  R. Kass,et al.  Geometrical Foundations of Asymptotic Inference , 1997 .

[9]  Fernando Pereira,et al.  Aggregate and mixed-order Markov models for statistical language processing , 1997, EMNLP.

[10]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  Alan Thornton Gous,et al.  Exponential and spherical subfamily models , 1998 .

[13]  Thomas Hofmann,et al.  Learning from Dyadic Data , 1998, NIPS.

[14]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[15]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[16]  D. Geiger,et al.  Stratified exponential families: Graphical models and model selection , 2001 .