Differentially private maximal frequent sequence mining

In this paper, we study the problem of designing a differentially private algorithm for mining maximal frequent sequences, which can not only achieve high data utility and a high degree of privacy, but also provide high time efficiency. To solve this problem, we present a new differentially private algorithm, which is referred to as DP-MFSM. DP-MFSM consists of three phases: pre-processing phase, expected frequent sequence mining (ESM) phase, and candidate extraction and verification (CEV) phase. Specifically, in the pre-processing phase, we first extract some statistical information from the input database, and use the extracted information to determine the values of some variables which will be used in the ESM phase. Then, in the ESM phase, we randomly partition the input database into several sub-databases, and use a partition-based ESM technique to find expected frequent sequences, which are a subset of candidate frequent sequences and more likely to be frequent. At last, in the CEV phase, we extract candidate maximal frequent sequences from the discovered expected frequent sequences, and use a splitting-based technique to verify which candidates are actually frequent in the input database. Through privacy analysis, we show that our DP-MFSM algorithm is e-differentially private. Extensive experiments on real-world datasets illustrate that our DP-MFSM algorithm can substantially outperform alternative approaches.

[1]  Benjamin C. M. Fung,et al.  Differentially private transit data publication: a case study on the montreal transportation system , 2012, KDD.

[2]  Chris Clifton,et al.  Top-k frequent itemsets via differentially private FP-trees , 2014, KDD.

[3]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[4]  Yin Yang,et al.  Functional Mechanism: Regression Analysis under Differential Privacy , 2012, Proc. VLDB Endow..

[5]  Xiang Cheng,et al.  Differentially private frequent sequence mining via sampling-based candidate pruning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[6]  Philip S. Yu,et al.  Correlated network data publication via differential privacy , 2013, The VLDB Journal.

[7]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[8]  Ting Yu,et al.  Mining frequent graph patterns with differential privacy , 2013, KDD.

[9]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[10]  Stavros Papadopoulos,et al.  Differentially Private Event Sequences over Infinite Streams , 2014, Proc. VLDB Endow..

[11]  Li Xiong,et al.  A two-phase algorithm for mining sequential patterns with differential privacy , 2013, CIKM.

[12]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[13]  Qian Xiao,et al.  Differentially private network data release via structural inference , 2014, KDD.

[14]  Claude Castelluccia,et al.  Differentially private sequential data publication via variable-length n-grams , 2012, CCS.

[15]  Adam D. Smith,et al.  Discovering frequent patterns in sensitive data , 2010, KDD.

[16]  Saralees Nadarajah,et al.  On the linear combination of normal and Laplace random variables , 2006, Comput. Stat..

[17]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[18]  Yin Yang,et al.  PrivGene: differentially private model fitting using genetic algorithms , 2013, SIGMOD '13.

[19]  Eloísa Díaz-Francés,et al.  Correction to “On the linear combination of normal and Laplace random variables”, by Nadarajah, S., Computational Statistics, 2006, 21, 63–71 , 2008, Comput. Stat..

[20]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[21]  Ninghui Li,et al.  PrivBasis: Frequent Itemset Mining with Differential Privacy , 2012, Proc. VLDB Endow..

[22]  Luca Bonomi,et al.  Mining Frequent Patterns with Differential Privacy , 2013, Proc. VLDB Endow..

[23]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[24]  Tim Roughgarden,et al.  Universally utility-maximizing privacy mechanisms , 2008, STOC '09.

[25]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[26]  Jeffrey F. Naughton,et al.  On differentially private frequent itemset mining , 2012, Proc. VLDB Endow..