论文信息 - The Research of Large Scale Data Processing Platform Based on the Spark

The Research of Large Scale Data Processing Platform Based on the Spark

With the development of technologies of cloud computing and distributed cluster, the concept of big data was extended widely and deeply in volume and value, and data mining that plays an important role in exploring big data was attracted unprecedented attention in recent years. Traditional data mining algorithms is incapable to deal with massive dataset. MapReduce has been successfully applied in many big data problems, however, it lacks the ability to efficiently support paralyzed, iterative learning. To address the above problems, we give an integrated solution based on the Spark framework, not only process massive data efficiently, but also with a favorable scalability, which can satisfy the demand of many kinds of data mining tasks. Further we propose a framework applied in traffic field.

Cao Xin | Chu Na

[1] R. V. van Nieuwpoort,et al. The Grid 2: Blueprint for a New Computing Infrastructure , 2003 .

[2] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3] Ming-Yen Lin,et al. Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[4] Hui Qiang Wang,et al. A Cloud Security Situational Awareness Model Based on Parallel Apriori Algorithm , 2014 .

[5] Bi-Ru Dai,et al. Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[6] Bo He,et al. A Parallel Algorithm for Mining Association Rules Based on FP-tree , 2011, CSEE.

[7] Ralf Lämmel,et al. Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[8] Nathan Marz,et al. Big Data: Principles and best practices of scalable realtime data systems , 2015 .