Protecting privacy in released database views

To cooperate on common tasks, organizations need to release data to each other from their private data sources. An important consideration at the time when data are released to satisfy the needs of applications is personal privacy. Only the data with tolerable privacy disclosure can be released. To decide whether the disclosure is tolerable, privacy disclosure measures and checking algorithms are needed. Released data are often database views that are query results from private databases. This thesis focuses on identifying disclosure measures for database views and describing how to check database views for privacy disclosure based on these measures. More specifically, this thesis introduces two formally defined measures, k-uncertainty and k-indistinguishability. Given a positive integer k, k-uncertainty measures one aspect of privacy disclosure. Intuitively, released database views provide k-uncertainty when a particular attribute value of an entity cannot be determined to be among at least k possibilities by using the views together with the schema information of the private table. The released database views that violate k-uncertainty should be blocked. This thesis shows that, in general, whether a set of views violates k-uncertainty is a computationally hard problem. Subcases are identified and their computational complexities discussed. Especially interesting are those subcases that yield polynomial checking algorithms (in the number of tuples in the views). This thesis also provides an efficient conservative algorithm that checks for sufficient conditions of k-uncertainty. Given a positive integer k, k-indistinguishability addresses the other aspect of privacy, namely, indistinguishability among k individuals in terms of their possible private values. Prior work has focused on using uncertainty as the metric, i.e., the uncertainty of the private property value of an individual. However, indistinguishability is necessary to hide an individual in a crowd, thus shielding the individual from any particular attention. This thesis initiates a study of indistinguishability by giving three different definitions of indistinguishability, one based on probability and the other two based on data symmetry. These definitions have different properties and lead to different k-indistinguishability metrics. This thesis then discusses computational complexities of checking whether a set of database views provides enough indistinguishability to protect privacy, and also provides practical checking algorithms. The information in database views may be released in the form of text documents. This thesis also studies a specific release-control task, namely, how to control the release of sensitive associations existing in private tables. This thesis introduces effective techniques to check outgoing documents for the appearance of such sensitive associations. Experiments show that for a reasonably sized database, sensitive associations can be stored efficiently and outgoing documents can be checked quickly. Finally, to regulate data release and apply the above measures and checking methods, the thesis presents a general hierarchical release control policy framework.