Parallel mining frequent patterns over big transactional data in extended mapreduce

In big data era, data size has raised from TB-level to PB-level. Traditional algorithm can not satisfy the needs of big data computing. This paper design a parallel algorithm for mining frequent pattern over big transactional data based on an extended MapReduce Frame. In which, the mass data file is firstly split into many data subfiles, the patterns in each subfile can be quickly located based on bitmap computation by scanning the data only once. And the computing results of all subfiles are merged for mining the frequent patterns in the whole big data. In order to improve the performance of the proposed method, the insignificant patterns are pruned by a statistic analysis method when the data subfiles are processed. The experimental results show that the method is efficient, strong in scalability, and can be used to efficiently mine frequent patterns in big data.

[1]  Hui Chen,et al.  Mining frequent patterns in a varying-size sliding window of online transactional data streams , 2012, Inf. Sci..

[2]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[3]  ShimKyuseok MapReduce algorithms for big data analysis , 2012, VLDB 2012.

[4]  Won Suk Lee,et al.  Finding recent frequent itemsets adaptively over online data streams , 2003, KDD '03.

[5]  Kyuseok Shim,et al.  MapReduce Algorithms for Big Data Analysis , 2012, Proc. VLDB Endow..

[6]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[7]  Bart Goethals,et al.  A tight upper bound on the number of candidate patterns , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[8]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[9]  Gilad Mishne,et al.  Fast data in the era of big data: Twitter's real-time related query suggestion architecture , 2012, SIGMOD '13.

[10]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[11]  Michael C. Schatz,et al.  Rapid parallel genome indexing with MapReduce , 2011, MapReduce '11.

[12]  Jorge-Arnulfo Quiané-Ruiz,et al.  Efficient Big Data Processing in Hadoop MapReduce , 2012, Proc. VLDB Endow..

[13]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[14]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[15]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[16]  Hongjun Lu,et al.  A false negative approach to mining frequent itemsets from high speed transactional data streams , 2006, Inf. Sci..