论文信息 - Parallelization of Algorithms for Mining Data from Distributed Sources

Parallelization of Algorithms for Mining Data from Distributed Sources

We suggest an approach to optimize data mining in modern applications that work on distributed data. We formally transform a high-level functional representation of a data-mining algorithm into a parallel implementation that performs as much as possible computations locally at the data sources, rather than accumulating all data for processing at a central location as in the traditional MapReduce approach. Our approach avoids the main disadvantages of the state-of-the-art MapReduce frameworks in the context of distributed data: increased run time, high network traffic, and an unauthorized access to data. We use the popular data-mining algorithm – Naive Bayes – for illustrating our approach and evaluating it experimentally. Our experiments confirm that the implementation of Naive Bayes developed by using our approach significantly outperforms the traditional MapReduce-based implementation regarding the run time and the network traffic.

Sergei Gorlatch | Andrey Shorov | Ivan Kholod | Maria Efimova

[1] Gianmarco De Francisci Morales,et al. SAMOA: scalable advanced massive online analysis , 2015, J. Mach. Learn. Res..

[2] Geppino Pucci,et al. Universality in VLSI Computation , 2011, ParCo 2011.

[3] Pat Langley,et al. Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[4] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[5] Arthur J. Bernstein,et al. Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[6] Harshawardhan S. Bhosale,et al. A Review Paper on Big Data and Hadoop , 2014 .

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8] Rajiv Ranjan,et al. G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[9] Sergei Gorlatch,et al. A formally based parallelization of data mining algorithms for multi-core systems , 2018, The Journal of Supercomputing.

[10] John Langford. Vowpal Wabbit , 2014 .

[11] Trevor Hastie,et al. The Elements of Statistical Learning , 2001 .

[12] Patrick Th. Eugster,et al. From the Cloud to the Atmosphere: Running MapReduce across Data Centers , 2014, IEEE Transactions on Computers.

[13] Abhishek Chandra,et al. Nebula: Distributed Edge Cloud for Data Intensive Computing , 2014, 2014 IEEE International Conference on Cloud Engineering.