An Upper-Bound Control Approach for Cost-Effective Privacy Protection of Intermediate Dataset Storage in Cloud

Along with more and more data intensive applications have been migrated into cloud environments, storing some valuable intermediate datasets has been accommodated in order to avoid the high cost of re-computing them. However, this poses a risk on data privacy protection because malicious parties may deduce the private information of the parent dataset or original dataset by analyzing some of those stored intermediate datasets. The traditional way for addressing this issue is to encrypt all of those stored datasets so that they can be hidden. We argue that this is neither efficient nor cost-effective because it is not necessary to encrypt ALL of those datasets and encryption of all large amounts of datasets can be very costly. In this paper, we propose a new approach to identify which stored datasets need to be encrypted and which not. Through intensive analysis of information theory, our approach designs an upper bound on privacy measure. As long as the overall mixed information amount of some stored datasets is no more than that upper bound, those datasets do not need to be encrypted while privacy can still be protected. A tree model is leveraged to analyze privacy disclosure of datasets, and privacy requirements are decomposed and satisfied layer by layer. With a heuristic implementation of this approach, evaluation results demonstrate that the cost for encrypting intermediate datasets decreases significantly compared with the traditional approach while the privacy protection of parent or original dataset is guaranteed.

[1]  Xiao Liu,et al.  A cost-effective strategy for intermediate data storage in scientific cloud workflow systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Paul T. Groth,et al.  Metadata and provenance management , 2010, Scientific Data Management.

[4]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[5]  Tim Mather,et al.  Cloud Security and Privacy - An Enterprise Perspective on Risks and Compliance , 2009, Theory in practice.

[6]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[7]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[8]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[9]  Debmalya Panigrahi,et al.  Preserving Module Privacy in Workflow Provenance , 2010, ArXiv.

[10]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Shankar Pasupathy,et al.  Maximizing Efficiency by Trading Storage for Computation , 2009, HotCloud.

[12]  Thomas M. Cover,et al.  Elements of Information Theory: Cover/Elements of Information Theory, Second Edition , 2005 .

[13]  Xiao Liu,et al.  Concurrency and Computation: Practice and Experience a Data Dependency Based Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems ‡ , 2022 .

[14]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Dr B Santhosh Kumar Santhosh Balan,et al.  Closeness : A New Privacy Measure for Data Publishing , 2022 .

[16]  Wenliang Du,et al.  Privacy-MaxEnt: integrating background knowledge in privacy quantification , 2008, SIGMOD Conference.

[17]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[18]  Benjamin C. M. Fung,et al.  Anonymizing sequential releases , 2006, KDD '06.

[19]  Xiao Liu,et al.  A Local-Optimisation Based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[20]  Wenliang Du,et al.  Understanding Privacy Risk of Publishing Decision Trees , 2010, DBSec.

[21]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[22]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[23]  Wenliang Du,et al.  Inference Analysis in Privacy-Preserving Data Re-publishing , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[24]  Hang Lau,et al.  A Java Library of Graph Algorithms and Optimization (Discrete Mathematics and Its Applications) , 2006 .

[25]  Bernd Grobauer,et al.  Understanding Cloud Computing Vulnerabilities , 2011, IEEE Security & Privacy.

[26]  Sanjeev Khanna,et al.  On provenance and privacy , 2010, ICDT '11.

[27]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.