Techniques for protecting data from piracy, illegal inference, and malicious intrusions

This thesis reports a systematic study of data protection in three different, yet related, scenarios. The first scenario is to guard relational data against unauthorized redistribution or data piracy. In this scenario, the data is released to users who have full control over the use of data but are restricted from any redistribution. A novel technique is presented for embedding fingerprints in database relations so that the original recipient of the data can be identified. In our scheme, one secret key (with or without primary key attributes in database relations) is used to decide the way how a fingerprint is embedded. Rigorous analysis shows that, with high probability, a detected fingerprint is indeed the fingerprint originally embedded, and embedded fingerprints cannot be modified or erased by a variety of attacks, including bits flipping, tuple addition and deletion; secret key guessing, and collusion among multiple recipients of the same relation. In addition, as a special case of fingerprinting, a robust watermarking scheme is also presented for relational databases. The second scenario is to protect the privacy of individual data value against interval-based inference from aggregation queries. Different from the first scenario in which the data is released to users, this scenario is about data that are controlled by its owner. Users are allowed to access the data through aggregation queries. An individual data value is said to be compromised if an accurate enough interval, called inference interval, is obtained from aggregation results such that the actual data value must fall into the interval. Our study shows that it is intractable to audit interval-based inference for bounded integer values; while for bounded real values, the auditing problem has a polynomial time complexity involving mathematical programming with a large number of constraints and/or variables. The last scenario is to detect anomaly from usage log data. Different from the first scenario in which the data are released to users as well as the second scenario in which the data are allowed to be queried by users, this scenario is about using log data to build profiles for user normal behaviors and detect suspicious anomalies that deviate from the profiles. Experiments show that our proposed method is more flexible and precise than previous methods that do not use time information or simply use fixed partition of time intervals in profiling. (Abstract shortened by UMI.)