Clustering Distributed Homogeneous Datasets

In this paper we present an elegant and effective algorithm for measuring the similarity between homogeneous datasets to enable clustering. Once similar datasets are clustered, each cluster can be independently mined to generate the appropriate rules for a given cluster. The algorithm presented is efficient in storage and scale, has the ability to adjust to time constraints, and can provide the user with likely causes of similarity or dis-similarity. The proposed similarity measure is evaluated and validated on real datasets from the Census Bureau, Reuters, and synthetic datasets from IBM.

[1]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[2]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[3]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[4]  Alberto O. Mendelzon,et al.  Similarity-based queries , 1995, PODS '95.

[5]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[6]  Ramesh Subramonian Defining diff as a Data Mining Primitive , 1998, KDD.

[7]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[8]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[9]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[10]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  Roy Goldman,et al.  Proximity Search in Databases , 1998, VLDB.

[12]  Heikki Mannila,et al.  Similarity of Attributes by External Probes , 1998, KDD.

[13]  L. Devroye A Course in Density Estimation , 1987 .

[14]  Philip S. Yu,et al.  Online generation of association rules , 1998, Proceedings 14th International Conference on Data Engineering.