Privacy preserving data mining for medical data

Privacy has always been a great concern of patients and medical service providers. As a result of the recent advances in Information Technology and the government's push for Electronic Health Record (EHR) systems, a large amount of data is collected and stored electronically. This data is an important and rich source for research and needs to be made available for mining, while at the same time patient privacy needs to be preserved. The management of medical data is heavily regulated by the Health Insurance Portability and Accountability Act (HIPAA) in the United States. This strong level of oversight and inherent characteristics of medical data make Privacy Preserving Medical Data Mining a special field of Privacy Preserving Data Mining (PPDM). Yet, research is quite limited in this field. This study pinpoints the following gaps in current research: 1. Privacy protection in the medical field means the protection of individuals from being associated with undesirable conditions, diagnoses or treatments (Sensitive Attributes). Most existing research only considers datasets with a single sensitive attribute, while most medical datasets contain multiple sensitive attributes (e.g., site, stage and histology of cancer). As a result, some well known privacy protection models such as L-diversity cannot be directly applied to such datasets. 2. Although medical researchers often describe their research plans when they request anonymized data, most existing PPDM methods do not use this information when de-identifying the data. As a result, the anonymized data may not be very useful for the planned mining task. This study investigates utility-based privacy protection techniques to address this problem. Our goal is to improve the utility of the anonymized data for statistical analyses that are frequently used in medical research, such as linear and logistic regression, proportional hazards model and classification. Our technique improves a popular privacy protection method called condensation such that the improved method will lead to de-identified datasets with more utility while the privacy in the transformed data is preserved. Our methods are tested and validated on real cancer surveillance data provided by the Kentucky Cancer Registry.