Efficient clustering of high-dimensional data sets with application to reference matching

important problems involve clustering large datasets. Although naive implementations of clustering are computa- tionally expensive, there are established ecient techniques for clustering when the dataset has either (1) a limited num- ber of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of eciently clustering datasets that are large in all three ways at once|for example, having millions of data points that exist in many thousands of di- mensions representing many thousands of clusters. We present a new technique for clustering these large, high- dimensional datasets. The key idea involves using a cheap, approximate distance measure to eciently divide the data into overlapping subsets we call canopies .T hen cluster- ing is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of cluster- ing approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present ex- perimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clus- tering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%.

[1]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[2]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[3]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[4]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[5]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[6]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[9]  E. Monge,et al.  The Eld Matching Problem: Algorithms and Applications , 1996 .

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Hirotugu Akaike,et al.  On entropy maximization principle , 1977 .

[12]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[13]  Jeremy A. Hylton,et al.  Identifying and Merging Related Bibliographic Records , 1996 .

[14]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[15]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[16]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[17]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.