Social scientists, like those performing research at the Kinsey Institute for Research in Sex, Gender and Reproduction, may use surveys to gather large amounts of sensitive data. Unlike purely medical-related datasets, these social science datasets tend to be sparse and high-dimensional, which presents opportunities to characterize participants in the dataset in unique ways. These unique characterizations may enable individuals to be linked to external data in ways that have not been previously considered. Therefore, traditional approaches to de-identifying data, such as fulfilling HIPAA requirements, may not be sufficient for preventing the re-identification of participants in large social science datasets.
In this paper, we evaluate the statistical characteristics of two high-dimensional social science datasets to better understand how unique features impact privacy. We apply a class of statistical de-anonymization attacks in an attempt to achieve theoretical re-identification of participants. We assume that an attacker has exact knowledge of a subset of attribute values for a particular record, and wants to link this subset of data to the actual record to discover the remaining content. We show that although 98% of the records within the dataset are unique given any three attributes, re-identification of the records may not be easily achieved. We attribute limited re-identification to the inherent similarity in the human behavior that the scientists measure. This work is the first to characterize re-identification risks in high-dimensional data that is collected in surveys designed to capture the various behaviors and experiences of groups of individuals.
[1]
D. Watson,et al.
Constructing validity: Basic issues in objective scale development
,
1995
.
[2]
Bradley Malin,et al.
Re-identification of Familial Database Records
,
2006,
AMIA.
[3]
Bradley Malin,et al.
How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems
,
2004,
J. Biomed. Informatics.
[4]
Bradley Malin,et al.
Re-identification of DNA through an automated linkage process
,
2001,
AMIA.
[5]
Chun Yuan,et al.
Differentially Private Data Release through Multidimensional Partitioning
,
2010,
Secure Data Management.
[6]
Massimo Barbaro,et al.
A Face Is Exposed for AOL Searcher No
,
2006
.
[7]
Latanya Sweeney,et al.
k-Anonymity: A Model for Protecting Privacy
,
2002,
Int. J. Uncertain. Fuzziness Knowl. Based Syst..
[8]
Cynthia Dwork,et al.
Differential Privacy
,
2006,
ICALP.
[9]
Vitaly Shmatikov,et al.
Robust De-anonymization of Large Sparse Datasets
,
2008,
2008 IEEE Symposium on Security and Privacy (sp 2008).
[10]
Lise Getoor,et al.
To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles
,
2009,
WWW '09.