Parallelization of Algorithms for Mining Data from Distributed Sources

We suggest an approach to optimize data mining in modern applications that work on distributed data. We formally transform a high-level functional representation of a data-mining algorithm into a parallel implementation that performs as much as possible computations locally at the data sources, rather than accumulating all data for processing at a central location as in the traditional MapReduce approach. Our approach avoids the main disadvantages of the state-of-the-art MapReduce frameworks in the context of distributed data: increased run time, high network traffic, and an unauthorized access to data. We use the popular data-mining algorithm – Naive Bayes – for illustrating our approach and evaluating it experimentally. Our experiments confirm that the implementation of Naive Bayes developed by using our approach significantly outperforms the traditional MapReduce-based implementation regarding the run time and the network traffic.