A highly scalable parallel algorithm for maximally informative k-itemset mining

The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when (1) the data set is massive, calling for large-scale distribution, and/or (2) the length k of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative $$\underline{K}$$K̲-ItemSet), a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large-scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of miki having different sizes, which drastically reduces the execution time, the communication cost and the energy consumption, in a distributed computational platform. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.

[1]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[2]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[3]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[4]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[5]  Heikki Mannila,et al.  Finding low-entropy sets and trees from binary data , 2007, KDD '07.

[6]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[7]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[8]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[9]  Byeong-Soo Jeong,et al.  Parallel and Distributed Frequent Pattern Mining in Large Databases , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  Florent Masseglia,et al.  Discovering Highly Informative Feature Sets from Data Streams , 2010, DEXA.

[12]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[13]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  H. B. Barlow,et al.  Unsupervised Learning , 1989, Neural Computation.

[16]  PeiJian,et al.  Mining Frequent Patterns without Candidate Generation , 2000 .

[17]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[18]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[19]  Philip S. Yu,et al.  A Regression-Based Temporal Pattern Mining Scheme for Data Streams , 2003, VLDB.

[20]  Michael L. Brodie,et al.  The meaningful use of big data: four perspectives -- four challenges , 2012, SGMD.

[21]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[22]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[23]  Srikanta J. Bedathur,et al.  Computing n-gram statistics in MapReduce , 2012, EDBT '13.

[24]  Michael W. Berry,et al.  Survey of Text Mining II , 2008 .

[25]  Bart Goethals,et al.  Frequent Itemset Mining for Big Data , 2013, 2013 IEEE International Conference on Big Data.

[26]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[27]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[28]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[29]  Nikolaj Tatti,et al.  Probably the best itemsets , 2010, KDD.

[30]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[31]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[32]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[33]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[34]  Eli Upfal,et al.  PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce , 2012, CIKM.

[35]  Robert M. Gray,et al.  Entropy and Information , 1990 .

[36]  María José del Jesús,et al.  An overview on subgroup discovery: foundations and applications , 2011, Knowledge and Information Systems.