Efficient probability density balancing for supporting distributed knowledge discovery in large databases

For the data received online from a source with an unknown probability distribution, the question addressed in this article is how to efficiently partition it to smaller representative subsets (databases) and how to organize these data subsets in order to minimize the computational cost of the later data analysis. The proposed linear-time, online problem decomposition method achieves these objectives through balancing probability distributions of the individual disjoint data subsets, each aimed at approximating the original data-source distribution. Consequently, computationally efficient statistical data analysis and neural network modelling on data subsets fitting into a computer central memory will produce results similar to these obtained through a global, computationally infeasible data analysis. In addition, the proposed decomposition scheme enables for an effective distributed data analysis on a network of workstations.

[1]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .