Making Big Data, Privacy, and Anonymization Work Together in the Enterprise: Experiences and Issues

Some scholars feel that Big Data techniques render anonymization (also known as de-identification) useless as a privacy protection technique. This paper discusses our experiences and issues encountered when we successfully combined anonymization, privacy protection, and Big Data techniques to analyze usage data while protecting the identities of users. Our Human Factors Engineering team wanted to use web page access logs and Big Data tools to improve usability of Intel's heavily used internal web portal. To protect Intel employees' privacy, they needed to remove personally identifying information (PII) from the portal's usage log repository but in a way that did not affect the use of Big Data tools to do analysis or the ability to re-identify a log entry in order to investigate unusual behavior. To meet these objectives, we created an open architecture for anonymization that allowed a variety of tools to be used for both de-identifying and re-identifying web log records. In the process of implementing our architecture, we found that enterprise data has properties different from the standard examples in anonymization literature. Our proof of concept showed that Big Data techniques could yield benefits in the enterprise environment even when working on anonymized data. We also found that despite masking obvious PII like usernames and IP addresses, the anonymized data was vulnerable to correlation attacks. We explored the tradeoffs of correcting these vulnerabilities and found that User Agent (Browser/OS) information strongly correlates to individual users. While browser fingerprinting has been known before, it has implications for tools and products currently used to de-identify enterprise data. We conclude that Big Data, anonymization, and privacy can be successfully combined but requires analysis of data sets to make sure that anonymization is not vulnerable to correlation attacks.

[1]  Omer Tene Jules Polonetsky,et al.  Privacy in the Age of Big Data: A Time for Big Decisions , 2012 .

[2]  Khaled El Emam,et al.  Anonymizing Health Data: Case Studies and Methods to Get You Started , 2013 .

[3]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[4]  Balaji Raghunathan,et al.  The Complete Book of Data Anonymization: From Planning to Implementation , 2013 .

[5]  Spyros Antonatos,et al.  On the Privacy Risks of Publishing Anonymized IP Network Traces , 2006, Communications and Multimedia Security.

[6]  Vitaly Shmatikov,et al.  How To Break Anonymity of the Netflix Prize Dataset , 2006, ArXiv.

[7]  Paul Ohm Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization , 2009 .

[8]  Sergey Vinogradov,et al.  Evaluation of Data Anonymization Tools , 2012, DBKDA 2012.

[9]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[10]  Vincent Rijmen,et al.  The Design of Rijndael: AES - The Advanced Encryption Standard , 2002 .

[11]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[12]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[13]  Peter Eckersley,et al.  How Unique Is Your Web Browser? , 2010, Privacy Enhancing Technologies.

[14]  Claudia Eckert,et al.  Flash: Efficient, Stable and Optimal K-Anonymity , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[15]  Johannes Gehrke,et al.  Interactive anonymization of sensitive data , 2009, SIGMOD Conference.

[16]  Paul Ohm,et al.  The Underwhelming Benefits of Big Data , 2013 .

[17]  Donald F. Towsley,et al.  Analyzing Privacy in Enterprise Packet Trace Anonymization , 2008, NDSS.