Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization

Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage, unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data.

[1]  John F. Canny,et al.  Large-scale behavioral targeting , 2009, KDD.

[2]  Leonard J. Schulman,et al.  Proceedings of the forty-second ACM symposium on Theory of computing , 2010, STOC 2010.

[3]  Michael Mitzenmacher,et al.  Proceedings of the forty-first annual ACM symposium on Theory of computing , 2009, STOC 2009.

[4]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[5]  Elaine Shi,et al.  Private and Continual Release of Statistics , 2010, TSEC.

[6]  Aleksandar Nikolov,et al.  Pan-private algorithms via statistics on sketches , 2011, PODS.

[7]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[8]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[9]  Arnd Christian König,et al.  Time Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams , 2016, SIGMOD Conference.

[10]  Johannes Gehrke,et al.  iReduct: differential privacy with reduced relative errors , 2011, SIGMOD '11.

[11]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[12]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[13]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[14]  Samuel Madden,et al.  Processing Analytical Queries over Encrypted Data , 2013, Proc. VLDB Endow..

[15]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[16]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[19]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[20]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[21]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[22]  Elaine Shi,et al.  GUPT: privacy preserving data analysis made easy , 2012, SIGMOD Conference.

[23]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[25]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[26]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[27]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[28]  Adam D. Smith,et al.  Privacy-preserving statistical estimation with optimal convergence rates , 2011, STOC '11.

[29]  Alexander J. Smola,et al.  Super-Samples from Kernel Herding , 2010, UAI.

[30]  Rómer Rosales,et al.  Simple and Scalable Response Prediction for Display Advertising , 2014, ACM Trans. Intell. Syst. Technol..

[31]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[32]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[33]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[34]  Peter Druschel,et al.  Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles , 2011, SOSP 2011.

[35]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[36]  John Langford,et al.  A Multiworld Testing Decision Service , 2016, ArXiv.

[37]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[38]  Christopher J. C. Burges,et al.  Dimension Reduction: A Guided Tour , 2010, Found. Trends Mach. Learn..

[39]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[40]  Moni Naor,et al.  Differential privacy under continual observation , 2010, STOC '10.

[41]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[42]  Peter Gutmann,et al.  Secure deletion of data from magnetic and solid-state memory , 1996 .

[43]  Wei Li,et al.  Exploitation and exploration in a performance based contextual advertising system , 2010, KDD.

[44]  Michael I. Jordan,et al.  The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.

[45]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[46]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[47]  Hari Balakrishnan,et al.  CryptDB: protecting confidentiality with encrypted query processing , 2011, SOSP.

[48]  Emiliano De Cristofaro,et al.  Efficient Private Statistics with Succinct Sketches , 2015, NDSS.

[49]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[50]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[51]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[52]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..