An examination of data confidentiality and disclosure issues related to publication of empirical ROC curves.

RATIONALE AND OBJECTIVES Grant funding institutions often require organizations to share their collected data as widely as possible while safeguarding the privacy of individuals. Summaries based on these data are often released. Here, the receiver operating characteristic (ROC) curve is explored for potential statistical disclosures in the presence of auxiliary data. MATERIALS AND METHODS Formulas are introduced for calculating the missing data points from the full data set, given that a user has an empirical ROC curve and a subset of the data used to generate such a curve. Further, a discussion of the plausibility of this scenario is presented. RESULTS Diagnostic test data were simulated and an ROC curve was produced. Using a subset of the true data and the points on the empirical ROC curve, an attempt was made to reproduce the missing parts of the data. Disease statuses were able to be determined exactly, whereas test scores were solved for up to their rank. CONCLUSIONS If an individual or organization possessed the points of an empirical ROC curve and a subset of the true data, the true data underlying the ROC curve can be reproduced relatively accurately. As a result, the release of summaries of data, including the ROC curve, must be given careful thought before their release from a statistical disclosure perspective.

[1]  G. Paass Disclosure Risk and Disclosure Avoidance for Microdata , 1988 .

[2]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[3]  A. Meyer The Health Insurance Portability and Accountability Act. , 1997, Tennessee medicine : journal of the Tennessee Medical Association.

[4]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[5]  David Avrin HIPAA privacy and DICOM anonymization for research. , 2008, Academic radiology.

[6]  Gregory J. Matthews,et al.  Assessing database privacy using the area under the receiver-operator characteristic curve , 2010, Health Services and Outcomes Research Methodology.

[7]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[8]  Cynthia Dwork,et al.  Privacy-Preserving Datamining on Vertically Partitioned Databases , 2004, CRYPTO.

[9]  Chris J. Skinner,et al.  Statistical disclosure control for survey data , 2009 .

[10]  E. Pisano,et al.  Core curriculum: research ethics for radiology residents. , 2009, Academic radiology.

[11]  Jerome P. Reiter,et al.  New Approaches to Data Dissemination: A Glimpse into the Future (?) , 2004 .

[12]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[13]  W. Keller,et al.  Disclosure control of microdata , 1990 .

[14]  C. Skinner,et al.  Disclosure control for census microdata , 1994 .

[15]  S. Jha Communicating results directly to patients: don't ignore the price tag of this added "value". , 2012, Academic radiology.

[16]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[17]  E. Krupinski,et al.  Direct reporting of results to patients: the future of radiology? , 2012, Academic radiology.

[18]  Adam D. Smith,et al.  Efficient, Differentially Private Point Estimators , 2008, ArXiv.

[19]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[20]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[21]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[23]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Gregory J. Matthews,et al.  Examining the robustness of fully synthetic data techniques for data with binary variables , 2010 .

[25]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[26]  Ofer Harel,et al.  Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy , 2011 .

[27]  D. Lambert,et al.  The Risk of Disclosure for Microdata , 1989 .

[28]  D. Fetzer,et al.  The HIPAA privacy rule and protected health information: implications in research involving DICOM image databases. , 2008, Academic radiology.

[29]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[30]  C. Skinner,et al.  The case for samples of anonymized records from the 1991 census. , 1991, Journal of the Royal Statistical Society. Series A,.

[31]  Howard Rockette,et al.  Statistical Evaluation of Diagnostic Performance: Topics in Roc Analysis , 2011 .

[32]  Ofer Harel,et al.  Assessing the privacy of randomized vector-valued queries to a database using the area under the receiver operating characteristic curve , 2012, Health Services and Outcomes Research Methodology.

[33]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .