Learning Nonstructural Distance Metric by Minimum Cluster Distortion

Much natural language processing still depends on the Euclidean (cosine) distance function between two feature vectors, but this has severe problems with regard to feature weightings and feature correlations. To answer these problems, we propose an optimal metric distance that can be used as an alternative to the cosine distance, thus accommodating the two problems at the same time. This metric is optimal in the sense of global quadratic minimization, and can be obtained from the clusters in the training data in a supervised fashion. We confirmed the effect of the proposed metric distance by a synonymous sentence retrieval task, document retrieval task and the K-means clustering of general vectorial data. The results showed constant improvement over the baseline method of Euclid and tf.idf, and were especially prominent for the sentence retrieval task, showing a 33% increase in the 11-point average precision.

[1]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4]  Michael W. Berry,et al.  Information Filtering Using the Riemannian SVD (R-SVD) , 1998, IRREGULAR.

[5]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[6]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[8]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[9]  Jun Suzuki,et al.  Hierarchical Directed Acyclic Graph Kernel: Methods for Structured Natural Language Data , 2003, ACL.

[10]  Toshiyuki Takezawa,et al.  Proposal of a very-large-corpus acquisition method by cell-formed registration , 2002, LREC.

[11]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[12]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[13]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[14]  Christos Faloutsos,et al.  MindReader: Querying Databases Through Multiple Examples , 1998, VLDB.

[15]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[16]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[17]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[18]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[19]  E. W. Weisstein,et al.  Moore-Penrose matrix inverse , 2004 .

[20]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[21]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.