Frequent Itemset Mining for Big Data

Traditional data mining tools, developed to extract actionable knowledge from data, demonstrated to be inadequate to process the huge amount of data produced nowadays. Even the most popular algorithms related to Frequent Itemset Mining, an exploratory data analysis technique used to discover frequent items co-occurrences in a transactional dataset, are inefficient with larger and more complex data. As a consequence, many parallel algorithms have been developed, based on modern frameworks able to leverage distributed computation in commodity clusters of machines (e.g., Apache Hadoop, Apache Spark). However, frequent itemset mining parallelization is far from trivial. The search-space exploration, on which all the techniques are based, is not easily partitionable. Hence, distributed frequent itemset mining is a challenging problem and an interesting research topic. In this context, our main contributions consist in an (i) exhaustive theoretical and experimental analysis of the best-in-class approaches, whose outcomes and open issues motivated (ii) the development of a distributed high-dimensional frequent itemset miner. The dissertation introduces also a data mining framework which takes strongly advantage of distributed frequent itemset mining for the extraction of a specific type of itemsets (iii). The theoretical analysis highlights the challenges related to the distribution and the preliminary partitioning of the frequent itemset mining problem (i.e. the search-space exploration) describing the most adopted distribution strategies. The extensive experimental campaign, instead, compares the expectations related to the algorithmic choices against the actual performances of the algorithms. We run more than 300 experiments in order to evaluate and discuss the performances of the algorithms with respect to different real life use cases and data distributions. The outcomes of the review is that no algorithm is universally superior and performances are heavily skewed by the data distribution. Moreover, we were able to identify a concrete lack as regards frequent pattern extraction within high-dimensional use cases. For this reason, we have developed our own distributed high-dimensional frequent itemset miner based on Apache Hadoop. The algorithm splits the search-space exploration into independent sub-tasks. However, since the exploration strongly benefits of a full-knowledge of the problem, we introduced an interleaving synchronization phase. The result is a trade-off between the benefits of a centralized state and the ones related to the additional computational power due to parallelism. The experimental benchmarks, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and reliability to memory issues. Finally, the dissertation introduces a data mining framework in which distributed itemset mining is a fundamental component of the processing pipeline. The aim of the framework is the extraction of a new type of itemsets, called misleading generalized itemsets.

[1]  Rosa Meo Theory of dependence values , 2000, TODS.

[2]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[3]  Foster J. Provost,et al.  Predictive Modeling With Big Data: Is Bigger Really Better? , 2013, Big Data.

[4]  Dong Liu,et al.  Distributed PrefixSpan algorithm based on MapReduce , 2012, 2012 International Symposium on Information Technologies in Medicine and Education.

[5]  Daniele Apiletti,et al.  BAC: A Bagged Associative Classifier for Big Data Frameworks , 2016, ADBIS.

[6]  Jiawei Han,et al.  Re-examination of interestingness measures in pattern mining: a unified framework , 2010, Data Mining and Knowledge Discovery.

[7]  Pablo Moscato,et al.  A new method for mining disjunctive emerging patterns in high-dimensional datasets using hypergraphs , 2014, Inf. Syst..

[8]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[9]  Charu C. Aggarwal,et al.  Feature Selection for Classification: A Review , 2014, Data Classification: Algorithms and Applications.

[10]  W. Marsden I and J , 2012 .

[11]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[12]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[13]  Bart Goethals,et al.  Survey on Frequent Pattern Mining , 2003 .

[14]  Rong Gu,et al.  YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[15]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[16]  Ming-Syan Chen,et al.  DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud , 2010, PAKDD.

[17]  Thanaruk Theeramunkong,et al.  A new method for finding generalized frequent itemsets in generalized association rule mining , 2002, Proceedings ISCC 2002 Seventh International Symposium on Computers and Communications.

[18]  Dino Ienco,et al.  Replacing Support in Association Rule Mining , 2009 .

[19]  Rosane Minghim,et al.  Visual text mining using association rules , 2007, Comput. Graph..

[20]  Ian Foster,et al.  Designing and building parallel programs , 1994 .

[21]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[22]  Tania Cerquitelli,et al.  Characterizing Thermal Energy Consumption through Exploratory Data Mining Algorithms , 2016, EDBT/ICDT Workshops.

[23]  Luca Cagliero,et al.  Infrequent Weighted Itemset Mining Using Frequent Pattern Growth , 2014, IEEE Transactions on Knowledge and Data Engineering.

[24]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[25]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[26]  C. Bauckhage,et al.  Analyzing Social Bookmarking Systems : A del . icio . us Cookbook , 2008 .

[27]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[28]  Ming-Syan Chen,et al.  Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud , 2013, 2013 IEEE International Congress on Big Data.

[29]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[30]  Lan Vu,et al.  Mining Frequent Patterns Based on Data Characteristics , 2012 .

[31]  Luca Cagliero Discovering Temporal Change Patterns in the Presence of Taxonomies , 2013, IEEE Transactions on Knowledge and Data Engineering.

[32]  Jae-Gil Lee,et al.  Geospatial Big Data: Challenges and Opportunities , 2015, Big Data Res..

[33]  Sangkyum Kim,et al.  Mining Flipping Correlations from Large Datasets with Taxonomies , 2011, Proc. VLDB Endow..