A Parallel MapReduce Algorithm to Efficiently Support Itemset Mining on High Dimensional Data

Abstract In today's world, large volumes of data are being continuously generated by many scientific applications, such as bioinformatics or networking. Since each monitored event is usually characterized by a variety of features, high-dimensional datasets have been continuously generated. To extract value from these complex collections of data, different exploratory data mining algorithms can be used to discover hidden and non-trivial correlations among data. Frequent closed itemset mining is an effective but computational expensive technique that is usually used to support data exploration. Thanks to the spread of distributed and parallel frameworks, the development of scalable approaches able to deal with the so called Big Data has been extended to frequent itemset mining. Unfortunately, most of the current algorithms are designed to cope with low-dimensional datasets, delivering poor performances in those use cases characterized by high-dimensional data. This work introduces PaMPa-HD, a MapReduce-based frequent closed itemset mining algorithm for high dimensional datasets. An efficient solution has been proposed to parallelize and speed up the mining process. Furthermore, different strategies have been proposed to easily configure the algorithm parameter. The experimental results, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and robustness to memory issues.

[1]  Rong Gu,et al.  YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[2]  Pablo Moscato,et al.  A new method for mining disjunctive emerging patterns in high-dimensional datasets using hypergraphs , 2014, Inf. Syst..

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Elena Baralis,et al.  Characterizing network traffic by means of the NetMine framework , 2009, Comput. Networks.

[5]  Yi-Ping Phoebe Chen,et al.  Association rule mining to detect factors which contribute to heart disease in males and females , 2013, Expert Syst. Appl..

[6]  Kavé Salamatian,et al.  Anomaly extraction in backbone networks using association rules , 2012, TNET.

[7]  George K. Karagiannidis,et al.  Efficient Machine Learning for Big Data: A Review , 2015, Big Data Res..

[8]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[9]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[10]  Shiow-Yang Wu,et al.  Sequence-Growth: A Scalable and Effective Frequent Itemset Mining Algorithm for Big Data Based on MapReduce Framework , 2015, 2015 IEEE International Congress on Big Data.

[11]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[12]  Weiming Shen,et al.  A distributed frequent itemset mining algorithm using Spark for Big Data analytics , 2015, Cluster Computing.

[13]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[16]  Fabio Pulvirenti,et al.  Frequent Itemset Mining for Big Data , 2017 .

[17]  Kun Zhang,et al.  Iterative sampling based frequent itemset mining for big data , 2015, Int. J. Mach. Learn. Cybern..

[18]  Akira Nakamura,et al.  Direction-independent grammars with contexts , 1986, Inf. Sci..

[19]  Bernard Kamsu-Foguem,et al.  Mining association rules for the quality improvement of the production process , 2013, Expert Syst. Appl..

[20]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[21]  Eli Upfal,et al.  Mining Frequent Itemsets through Progressive Sampling with Rademacher Averages , 2015, KDD.

[22]  Jose Miguel Puerta,et al.  Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking , 2012, Knowl. Based Syst..

[23]  Elena Baralis,et al.  SeaRum: A Cloud-Based Service for Association Rule Mining , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[24]  Bharat Tidke,et al.  Frequent itemset mining for Big Data in social media using ClustBigFIM algorithm , 2015, 2015 International Conference on Pervasive Computing (ICPC).

[25]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[26]  Dario Rossi,et al.  Experiences of Internet traffic monitoring with tstat , 2011, IEEE Network.

[27]  Benoit Claise,et al.  Cisco Systems NetFlow Services Export Version 9 , 2004, RFC.

[28]  Benjamin W. Wah,et al.  Significance and Challenges of Big Data Research , 2015, Big Data Res..

[29]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[30]  Jae-Gil Lee,et al.  Geospatial Big Data: Challenges and Opportunities , 2015, Big Data Res..

[31]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[32]  V. Jacobson,et al.  Congestion avoidance and control , 1988, CCRV.

[33]  Elena Baralis,et al.  PaMPa-HD: A Parallel MapReduce-Based Frequent Pattern Miner for High-Dimensional Data , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[34]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.