SILVERBACK+: scalable association mining via fast list intersection for columnar social data

We present Silverback+, a scalable probabilistic framework for accurate association rule and frequent item-set mining of large-scale social behavioral data. Silverback+ tackles the problem of efficient storage utilization and management via: (1) probabilistic columnar infrastructure and (2) using Bloom filters and sampling techniques. In addition, probabilistic pruning techniques based on Apriori method are developed, for accelerating the mining of frequent item-sets. The proposed target-driven techniques yield a significant reduction of the size of the frequent item-set candidates, as well as the required number of repetitive membership checks through a novel list intersection algorithm. Extensive experimental evaluations demonstrate the benefits of this context-aware consideration and incorporation of the infrastructure limitations when utilizing the corresponding research techniques. When compared to the traditional Hadoop-based approach for improving scalability by straightforwardly adding more hosts, Silverback+ exhibits a much better runtime performance, with negligible loss of accuracy.

[1]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[2]  HofmannThomas,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997 .

[3]  Ouri Wolfson,et al.  Spatio-temporal data reduction with deterministic error bounds , 2003, DIALM-POMC.

[4]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  J. Jaccard,et al.  Interaction effects in multiple regression , 1992 .

[6]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[7]  Jin Huang,et al.  Computing Spatial Distance Histograms for Large Scientific Data Sets On-the-Fly , 2014, IEEE Transactions on Knowledge and Data Engineering.

[8]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[9]  Srinivasan Parthasarathy,et al.  A localized algorithm for parallel association mining , 1997, SPAA '97.

[10]  Jan Stallaert,et al.  An Economic Analysis of Online Advertising Using Behavioral Targeting , 2010, MIS Q..

[11]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[12]  Alok N. Choudhary,et al.  SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[13]  Graham Cormode,et al.  Approximate continuous querying over distributed streams , 2008, TODS.

[14]  Ming-Yen Lin,et al.  Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[15]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[16]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[17]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[18]  Shahram Latifi,et al.  A survey on data compression in wireless sensor networks , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[19]  Edith Cohen,et al.  Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[20]  Samy Bengio,et al.  Local collaborative ranking , 2014, WWW.

[21]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[22]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Vagelis Hristidis,et al.  Authority-based keyword search in databases , 2008, TODS.

[24]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[25]  Alok N. Choudhary,et al.  Probabilistic macro behavioral targeting , 2012, DUBMMSM '12.

[26]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[27]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[28]  Cevdet Aykanat,et al.  A Space Optimization for FP-Growth , 2004, FIMI.

[29]  Beng Chin Ooi,et al.  Efficient indexing structures for mining frequent patterns , 2002, Proceedings 18th International Conference on Data Engineering.

[30]  Chia-Chu Chiang,et al.  A Parallel Apriori Algorithm for Frequent Itemsets Mining , 2006, Fourth International Conference on Software Engineering Research, Management and Applications (SERA'06).

[31]  Soon Myoung Chung,et al.  Parallel mining of maximal frequent itemsets from databases , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[32]  Jin Huang,et al.  Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees , 2013, IEEE Transactions on Knowledge and Data Engineering.

[33]  Ling Qiu,et al.  Preserving privacy in association rule mining with bloom filters , 2006, Journal of Intelligent Information Systems.

[34]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[35]  Ida Pu Fundamental Data Compression , 2005 .

[36]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[37]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[38]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[39]  Alok N. Choudhary,et al.  Graphical Modeling of Macro Behavioral Targeting in Social Networks , 2013, SDM.