论文信息 - SILVERBACK+: scalable association mining via fast list intersection for columnar social data

SILVERBACK+: scalable association mining via fast list intersection for columnar social data

We present Silverback+, a scalable probabilistic framework for accurate association rule and frequent item-set mining of large-scale social behavioral data. Silverback+ tackles the problem of efficient storage utilization and management via: (1) probabilistic columnar infrastructure and (2) using Bloom filters and sampling techniques. In addition, probabilistic pruning techniques based on Apriori method are developed, for accelerating the mining of frequent item-sets. The proposed target-driven techniques yield a significant reduction of the size of the frequent item-set candidates, as well as the required number of repetitive membership checks through a novel list intersection algorithm. Extensive experimental evaluations demonstrate the benefits of this context-aware consideration and incorporation of the infrastructure limitations when utilizing the corresponding research techniques. When compared to the traditional Hadoop-based approach for improving scalability by straightforwardly adding more hosts, Silverback+ exhibits a much better runtime performance, with negligible loss of accuracy.

[1] Vipin Kumar,et al. Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[2] HofmannThomas,et al. Pairwise Data Clustering by Deterministic Annealing , 1997 .

[3] Ouri Wolfson,et al. Spatio-temporal data reduction with deterministic error bounds , 2003, DIALM-POMC.

[4] Edith Cohen,et al. Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5] J. Jaccard,et al. Interaction effects in multiple regression , 1992 .

[6] Tomasz Imielinski,et al. Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[7] Jin Huang,et al. Computing Spatial Distance Histograms for Large Scientific Data Sets On-the-Fly , 2014, IEEE Transactions on Knowledge and Data Engineering.

[8] Edward Y. Chang,et al. Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[9] Srinivasan Parthasarathy,et al. A localized algorithm for parallel association mining , 1997, SPAA '97.

[10] Jan Stallaert,et al. An Economic Analysis of Online Advertising Using Behavioral Targeting , 2010, MIS Q..

[11] Vipin Kumar,et al. Introduction to Data Mining, (First Edition) , 2005 .

[12] Alok N. Choudhary,et al. SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[13] Graham Cormode,et al. Approximate continuous querying over distributed streams , 2008, TODS.

[14] Ming-Yen Lin,et al. Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[15] Jeffrey Scott Vitter,et al. Random sampling with a reservoir , 1985, TOMS.

[16] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[17] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[18] Shahram Latifi,et al. A survey on data compression in wireless sensor networks , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[19] Edith Cohen,et al. Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[20] Samy Bengio,et al. Local collaborative ranking , 2014, WWW.

[21] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[22] Joachim M. Buhmann,et al. Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[23] Vagelis Hristidis,et al. Authority-based keyword search in databases , 2008, TODS.

[24] Mohammed J. Zaki. Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[25] Alok N. Choudhary,et al. Probabilistic macro behavioral targeting , 2012, DUBMMSM '12.

[26] Das Amrita,et al. Mining Association Rules between Sets of Items in Large Databases , 2013 .

[27] Prashant Malik,et al. Cassandra: a decentralized structured storage system , 2010, OPSR.

[28] Cevdet Aykanat,et al. A Space Optimization for FP-Growth , 2004, FIMI.

[29] Beng Chin Ooi,et al. Efficient indexing structures for mining frequent patterns , 2002, Proceedings 18th International Conference on Data Engineering.

[30] Chia-Chu Chiang,et al. A Parallel Apriori Algorithm for Frequent Itemsets Mining , 2006, Fourth International Conference on Software Engineering Research, Management and Applications (SERA'06).

[31] Soon Myoung Chung,et al. Parallel mining of maximal frequent itemsets from databases , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[32] Jin Huang,et al. Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees , 2013, IEEE Transactions on Knowledge and Data Engineering.

[33] Ling Qiu,et al. Preserving privacy in association rule mining with bloom filters , 2006, Journal of Intelligent Information Systems.

[34] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[35] Ida Pu. Fundamental Data Compression , 2005 .

[36] Michael Stonebraker,et al. H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[37] M. Kendall. A NEW MEASURE OF RANK CORRELATION , 1938 .

[38] Roberto J. Bayardo,et al. Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[39] Alok N. Choudhary,et al. Graphical Modeling of Macro Behavioral Targeting in Social Networks , 2013, SDM.