Conceptual Clustering of Heterogeneous Distributed Databases

With increasingly more databases becoming available on the Internet, there is a growing opportunity to globalise knowledge discovery and learn general patterns, rather than restricting learning to specific databases from which the rules may not be generalisable. Clustering of distributed databases facilitates learning of new concepts that characterise common features of, and differences between, datasets. We are here concerned with clustering databases that hold aggregate count data on a set of attributes that have been classified according to heterogeneous classification schemes. Such aggregates are commonly used for summarising very large databases such as those encountered in data warehousing, large-scale transaction management, and statistical databases. For measuring difference between aggregates we utilise two distance metrics: the Euclidean distance and the Kullback-Leibler information divergence. A hybrid between Kullback-Leibler and the Euclidean distance, which uses the former to learn the class probabilities and the latter as the corresponding distance measure, looks particularly promising both in terms of accuracy and scalability. These metrics are evaluated using synthetic data. Important applications of the work include the clustering of heterogeneous customer databases for the discovery of new marketing concepts and the clustering of medical databases for the discovery of new epidemiological concepts.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Francesco M. Malvestuto The derivation problem of summary data , 1988, SIGMOD '88.

[3]  Y. Vardi,et al.  From image deblurring to optimal investments : maximum likelihood solutions for positive linear inverse problems , 1993 .

[4]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[5]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  David A. Bell,et al.  Designing a Kernel for Data Mining , 1997, IEEE Expert.

[8]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[9]  Sally I. McClean,et al.  Optimal and Efficient Integration of Heterogeneous Summary Tables in a Distributed Database , 1999, Data Knowl. Eng..

[10]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[11]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[12]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[13]  Sally I. McClean,et al.  Efficient knowledge discovery through the integration of heterogeneous data , 1999, Inf. Softw. Technol..

[14]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[15]  Srinivasan Parthasarathy,et al.  Clustering Distributed Homogeneous Datasets , 2000, PKDD.

[16]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[17]  Hillol Kargupta,et al.  A Fourier Analysis Based Approach to Learning Decision Trees in a Distributed Environment , 2001, SDM.