Noise Addition for Protecting Privacy in Data Mining

In recent years advances in technology facilitated collection and storage of vast amount of data. Many organizations, including large and small businesses, and hospitals and government bodies rely on data for day-to-day operation as well as marketing, planning and research purposes. Examples include criminal records used by law enforcement and national security agencies, medical records used for treatment and research purposes and shopping records used for marketing and enhancing business strategies. The benefits of the information extracted from such data can hardly be overestimated. For example, we are all witnessing huge progress made in the human genetic project bringing new promises of previously unimaginable treatments such as gene therapy. However, simultaneously with the data explosion there is an uprise of anxiety about the confidentiality of delicate personal information open to potential misuses. This is not necessarily limited to data as sensitive as medical and genetic records mentioned above. Other personal information, although not as vulnerable as health records, is also considered to be confidential and as such is open to malicious exploitation. For example, detailed credit card records can be used to monitor personal habits. The IBM Multinational Consumer Privacy Survey performed in 1999 in Germany, USA and UK illustrates public concern about privacy [6]. Most consumers (80%) feel that “consumers have lost all control over how personal information is collected and used by companies.” The majority of consumers (94%) are concerned about the possible misuse of their personal information. This survey also shows that, when it comes to the confidence that their personal information is properly handled, consumers have most trust in health care providers and banks and the least trust in credit card agencies and Internet companies. Personal data are typically collected with the consent (presumed or otherwise) of the subject. It seems that the main public concern comes from socalled secondary use of personal information without the consent of the subject, that is, any use other than the one for which the data were originally collected. In other words, consumers feel strongly that their personal information should not be sold to other organizations without their prior consent. Indeed, the above-mentioned survey shows that over 50% of respondents have asked a company not to sell their information. The main concern of collectors and owners of personal records is that public apprehension about privacy may result in difficulties in obtaining truthful information from individuals. Additionally, privacy concerns may lead to future laws and regulations that will restrict and constrain such data collection. In this paper we argue that it is possible to provide confidentiality of individual records and still preserve the usefulness of data for research and planning purposes. We first note that removing names and other unique identifiers is not enough to ensure the confidentiality of personal records. Privacy invasion is possible whenever a record can be uniquely identified by using a combination of other attributes. For example, an individual, who is the only one with certain characteristics, say age of 25 and salary of 45000, may be uniquely identified and all other characteristics may be learned from other attributes in that record. Thus better techniques are needed to ensure privacy and a lot of work has been already done in the area of statistical databases (see, for example, [10,12,13]). In this paper we focus on records used for building decision trees and we develop techniques for adding noise to class and other attributes, using various probability distributions. We show that decision trees built from the perturbed records are the same or very similar to the trees built from the original records.