Scalable clustering on the data grid

With the development of e-Science service-oriented infrastructures based on the Grid and service computing, many data grids are being created and are becoming a new potential source of information available to scientists and data analysts. However mining distributed data sets still remains a challenge. Following the assumption that data may not be easily moved from one site to another (for performance, confidentiality or security reasons), we present a framework for distributed clustering where the data set is partitioned between several distant sites and the output is a mixture of Gaussian models. The data providers generate a clustering model using different classic clustering techniques (for the moment K-Means, EM) and return it to one central site using a standard PMML representation for the model. The central site uses these models as starting observations for EM iterations to estimate the final model parameters. Although known to be slower than other simpler techniques, EM is here applied to a relatively small set of observations and provides a probabilistic framework for the model combination. An initial version of the framework has been implemented and deployed on the Discovery Net infrastructure that provides support for data, resource management and workflow composition. We present empirical results that show the advantages of this approach and show that the final model stays accurate while allowing the mining of very large distributed data sets.