论文信息 - Research and Implementation of Parallel Data Mining of Process Object Based on Spark

Research and Implementation of Parallel Data Mining of Process Object Based on Spark

In this paper, we study the parallel data mining based on Spark, and apply it to the data analysis of process object. We propose some parallel algorithm flow solutions based on Spark by studying the algorithm flow of stand-alone process object data mining. Through programming, parallel efficiency testing and algorithm tuning, we conclude an optimized parallel algorithm flow. These solutions improve the computational efficiency.

[1] Davide Anguita,et al. Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf , 2015, INNS Conference on Big Data.

[2] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[3] Shouning Qu,et al. A Scheme for Mining State Association Rules of Process Object Based on Big Data , 2014 .

[4] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[6] Li Jun,et al. An Improved Apriori Algorithm Based On the Boolean Matrix and Hadoop , 2011 .

[7] Reynold Xin,et al. Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..

[8] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9] Sachchidanand Singh,et al. Big Data analytics , 2012 .