Models and algorithms for privacy-preserving data mining

As computing technologies continue to advance there has been a rapid growth in the amount of digitally available personal information, bringing the privacy concerns of individuals to a forefront. This thesis explores modeling and algorithmic problems that arise from the need to protect the privacy of individuals in a variety of different settings. Specifically, we study three different problems. The first problem we consider is that of online query auditing . The focus here is on an interactive scenario wherein users pose aggregate queries over a statistical database containing private data. The privacy task is to deny queries when answers to the current and past queries may be stitched together by a malicious user to infer private information. We demonstrate an efficient scheme for auditing bags of max and min queries to protect against a certain kind of privacy breach. Additionally, we study, for the first time, the utility of auditing algorithms and provide initial results for the utility of an existing algorithm for auditing sum queries. The second problem we study is that of anonymizing unstructured data. We consider datasets in which each individual is associated with a set of items that constitute private information about the individual. Illustrative datasets include market-basket datasets and search engine query logs. We formalize the notion of k-anonymity for set-valued data as a variant of the k-anonymity model for traditional relational datasets. We define an optimization problem that arises from this definition of anonymity and provide a constant factor approximation algorithm for the same. We experimentally evaluate our algorithms on the America Online query log dataset. In the last problem, we examine privacy concerns in online social networks where the private information to be protected is the profile information of a user or the set of individuals in the network that a user interacts with. We identify limits on the amount of lookahead that a social network should provide each user to protect the privacy of its users from hijacking attacks. The lookahead of a network is essentially the amount of neighborhood visibility a network provides each user. And a hijacking attack is one in which an attacker strategically subverts (hijacks) user accounts in the network to gain access to different local neighborhoods of the network. The goal of the attacker is to piece together these local neighborhoods to build a complete picture of the social network. By analyzing both experimentally and theoretically the feasibility of such attacks as a function of the lookahead of the social network, we make recommendations for what the default lookahead settings of a privacy-conscious social network should be.