Distributed clustering and local regression for knowledge discovery in multiple spatial databases

Many large-scale spatial data analysis problems involve an investigation of relationships in heterogeneous databases. In such situations, instead of making predictions uniformly across entire spatial data sets, in a previous study we used clustering for identifying similar spatial regions and then constructed local regression models describing the relationship between data characteristics and the target value inside each cluster. This approach requires all the data to be resident on a central machine, and it is not applicable when a large volume of spatial data is distributed at multiple sites. Here, a novel distributed method for learning from heterogeneous spatial databases is proposed. Similar regions in multiple databases are identified by independently applying a spatial clustering algorithm on all sites, followed by transferring convex hulls corresponding to identified clusters and their integration. For each discovered region, the local regression models are built and transferred among data sites. The proposed method is shown to be computationally efficient and fairly accurate when compared to an approach where all the data are available at a central location.