Discriminative Training of Clustering Functions: Theory and Experiments with Entity Identification

Clustering is an optimization procedure that partitions a set of elements to optimize some criteria, based on a fixed distance metric defined between the elements. Clustering approaches have been widely applied in natural language processing and it has been shown repeatedly that their success depends on defining a good distance metric, one that is appropriate for the task and the clustering algorithm used. This paper develops a framework in which clustering is viewed as a learning task, and proposes a way to train a distance metric that is appropriate for the chosen clustering algorithm in the context of the given task. Experiments in the context of the entity identification problem exhibit significant performance improvements over state-of-the-art clustering approaches developed for this problem.

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[3]  Dan Klein,et al.  Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach , 2002, ICML.

[4]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[5]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[6]  Kenji Kita,et al.  Learning Nonstructural Distance Metric by Minimum Cluster Distortion , 2004, EMNLP.

[7]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[8]  Dan Roth,et al.  Learning to Resolve Natural Language Ambiguities: A Unified Approach , 1998, AAAI/IAAI.

[9]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[10]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[11]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[12]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[13]  Claire Cardie,et al.  Noun Phrase Coreference as Clustering , 1999, EMNLP.

[14]  Lillian Lee,et al.  Similarity-Based Approaches to Natural Language Processing , 1997, ArXiv.

[15]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .

[16]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[17]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[18]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[19]  Dan Roth,et al.  Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches , 2004, AAAI.

[20]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[21]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[22]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[23]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[24]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[25]  Ido Dagan,et al.  Feature Vector Quality and Distributional Similarity , 2004, COLING.

[26]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[27]  John F. Roddick,et al.  A comparative study and extensions to k-medoids algorithms , 2001 .