Anonymization techniques for large and dynamic data sets

We make two main contributions to the problem of privacy in data publishing. First, we tackle the problem of anonymizing data sets that reside on secondary storage. Data anonymization is a process by which data is transformed to protect the identities and other sensitive information regarding individuals in the data. One may view the preservation of privacy for larger-than-memory data sets as a scalability problem. Our solution to this problem takes advantage of parallels between data anonymization and indexing in databases. We also show that database indexes, besides scalability, present other benefits lacking in previous data anonymization techniques. The dynamics of a data set introduces an entirely new set of challenges not typically present for static data sets. One of the main challenges involves the ability to anonymize and publish a changing data set without violating the original privacy guarantees established by the data owner. Our second main contribution is the development of a theoretical framework for controlling inference in a dynamic environment. We pose our solutions in the presence of an adversary who monitors the published anonymized data set as it is updated and whose objective is to circumvent the privacy guarantees of the data set by retrieving sensitive information not intended by the data set owner. We define and rigorously prove conditions that must be maintained to precisely capture the knowledge attainable by the adversary.