论文信息 - A Framework for Finding Distributed Data Mining Strategies That are Intermediate Between Centralized

A Framework for Finding Distributed Data Mining Strategies That are Intermediate Between Centralized

Distributed data mining is emerging as a fundamental computational problem. A common approach with distributed data mining is to build separate models at geographically distributed sites and then to combine the models. At the other extreme, all of the data can be moved to a central site and a single model built. With the commodity internet and large data sets the former approach is the quickest but often the least accurate, while the latter approach is more accurate but generally quite expensive in terms of the time required. Of course there are a variety of intermediate strategies in which some of the data is moved and some of the data is left in place, analyzed locally, and the resulting models are moved and combined. These intermediate cases are becoming of practical signi cance with the explosion of ber and the emergence of high performance networks. In this paper, we examine this intermediate case in the context in which high performance networks are present and the cost function represents both computational and communication costs. We reduce the problem to a convex programming problem so that standard techniques can be applied. We illustrate our approach through the analysis of an example showing the complexity and richness of this class of problems.

Robert L. Grossman | Andrei L. Turinsky

[1] Andrew S. Grimshaw,et al. Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems , 1994, J. Parallel Distributed Comput..

[2] Ilker Hamzaoglu,et al. Scalable, Distributed Data Mining - An Agent Architecture , 1997, KDD.

[3] Salvatore J. Stolfo,et al. JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[4] Erik L. Johnson,et al. Collective Data Mining From Distributed , Vertically PartitionedFeature , 1998 .

[5] H. Sivakumar,et al. Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[6] Srinivasan Parthasarathy,et al. Customized Dynamic Load Balancing for a Network of Workstations , 1997, J. Parallel Distributed Comput..

[7] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.

[8] Anthony P. Reeves,et al. High performance computing on a cluster of workstations , 1992, Proceedings of the First International Symposium on High-Performance Distributed Computing. (HPDC-1).

[9] D. Madigan,et al. Bayesian Model Averaging for Linear Regression Models , 1997 .

[10] Thomas G. Dietterich. Machine-Learning Research Four Current Directions , 1997 .