PDM:A Parallel Data Analysis System Based on Hadoop

A PDM(Parallel Data Mining) system was built based on Hadoop.PDM contains a large number of parallel data analysis algorithms based on MapReduce computational framework.These algorithms not only contain the classic algorithms of ETL,data mining,data statistical and text analysis,but also introduce SNA(social network analysis) based on graph mining.The principle and implementation of the parallel multiple linear regression algorithm and the multi-source shortest path algorithm were described and the " Message-passing model " proposed can effectively solve the problem that MapReduce is difficult to deal with the adjacency matrix structure.This paper also illustrates some typical applications of telecommunications,such as the " Business recommendation " based on parallel k-means and decision tree algorithms,the " Marketing key points discovery " based on parallel PageRank algorithm and the like.Finally,the results of performance test show that the proposed system is suitable for dealing with large scale data efficiently.