A Framework for Finding Distributed Data Mining Strategies That are Intermediate Between Centralized

Distributed data mining is emerging as a fundamental computational problem. A common approach with distributed data mining is to build separate models at geographically distributed sites and then to combine the models. At the other extreme, all of the data can be moved to a central site and a single model built. With the commodity internet and large data sets the former approach is the quickest but often the least accurate, while the latter approach is more accurate but generally quite expensive in terms of the time required. Of course there are a variety of intermediate strategies in which some of the data is moved and some of the data is left in place, analyzed locally, and the resulting models are moved and combined. These intermediate cases are becoming of practical signi cance with the explosion of ber and the emergence of high performance networks. In this paper, we examine this intermediate case in the context in which high performance networks are present and the cost function represents both computational and communication costs. We reduce the problem to a convex programming problem so that standard techniques can be applied. We illustrate our approach through the analysis of an example showing the complexity and richness of this class of problems.