Uniqueness and how it impacts privacy in health-related social science datasets

Social scientists, like those performing research at the Kinsey Institute for Research in Sex, Gender and Reproduction, may use surveys to gather large amounts of sensitive data. Unlike purely medical-related datasets, these social science datasets tend to be sparse and high-dimensional, which presents opportunities to characterize participants in the dataset in unique ways. These unique characterizations may enable individuals to be linked to external data in ways that have not been previously considered. Therefore, traditional approaches to de-identifying data, such as fulfilling HIPAA requirements, may not be sufficient for preventing the re-identification of participants in large social science datasets. In this paper, we evaluate the statistical characteristics of two high-dimensional social science datasets to better understand how unique features impact privacy. We apply a class of statistical de-anonymization attacks in an attempt to achieve theoretical re-identification of participants. We assume that an attacker has exact knowledge of a subset of attribute values for a particular record, and wants to link this subset of data to the actual record to discover the remaining content. We show that although 98% of the records within the dataset are unique given any three attributes, re-identification of the records may not be easily achieved. We attribute limited re-identification to the inherent similarity in the human behavior that the scientists measure. This work is the first to characterize re-identification risks in high-dimensional data that is collected in surveys designed to capture the various behaviors and experiences of groups of individuals.