Improving record linkage with supervised learning for disclosure risk assessment

In data privacy, record linkage can be used as an estimator of the disclosure risk of protected data. To model the worst case scenario one normally attempts to link records from the original data to the protected data. In this paper we introduce a parametrization of record linkage in terms of a weighted mean and its weights, and provide a supervised learning method to determine the optimum weights for the linkage process. That is, the parameters yielding a maximal record linkage between the protected and original data. We compare our method to standard record linkage with data from several protection methods widely used in statistical disclosure control, and study the results taking into account the performance in the linkage process, and its computational effort.

[1]  Vicenç Torra,et al.  Constrained Microaggregation: Adding Constraints for Data Editing , 2008, Trans. Data Priv..

[2]  Vicenç Torra,et al.  Modeling decisions - information fusion and aggregation operators , 2007 .

[3]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[4]  Anna M. Manning,et al.  The Identification of Special Uniques , 2001 .

[5]  Vicenc Torra,et al.  Information Fusion in Data Mining , 2003 .

[6]  Javier Herranz,et al.  Rethinking rank swapping to decrease disclosure risk , 2008, Data Knowl. Eng..

[7]  Vicenç Torra,et al.  Microaggregation for Categorical Variables: A Median Based Approach , 2004, Privacy in Statistical Databases.

[8]  Ruth Brand,et al.  Microdata Protection through Noise Addition , 2002, Inference Control in Statistical Databases.

[9]  Nancy L. Spruill MEASURES OF CONFIDENTIALITY , 2002 .

[10]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[11]  R. Okafor Maximum likelihood estimation from incomplete data , 1987 .

[12]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[13]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[14]  William E. Winkler,et al.  Matching and record linkage , 2011 .

[15]  William E. Winkler Data Cleaning Methods , 2003 .

[16]  V. Torra,et al.  Comparing SDC Methods for Microdata on the Basis of Information Loss and Disclosure Risk , 2004 .

[17]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[18]  Josep Domingo-Ferrer,et al.  Inference Control in Statistical Databases, From Theory to Practice , 2002 .

[19]  Josep Domingo-Ferrer,et al.  Validating Distance-Based Record Linkage with Probabilistic Record Linkage , 2002, CCIA.

[20]  D. Lambert Measures of Disclosure Risks and Harm , 1993 .

[21]  Josep Domingo-Ferrer,et al.  Record linkage methods for multidatabase data mining , 2003 .

[22]  Eric R. Ziegel,et al.  Business survey methods , 1995 .

[23]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[24]  Rupert W. Ford,et al.  A Computational Algorithm for Handling the Special Uniques Problem , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[25]  Josep Domingo-Ferrer,et al.  Using Mahalanobis Distance-Based Record Linkage for Disclosure Risk Assessment , 2006, Privacy in Statistical Databases.

[26]  Javier Herranz,et al.  On the disclosure risk of multivariate microaggregation , 2008, Data Knowl. Eng..

[27]  Hemant Ishwaran,et al.  Identifying Likely Duplicates by Record Linkage in a Survey of Prostitutes , 2004 .

[28]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[29]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[30]  Marianne Houbiers,et al.  Towards a Social Statistical Database and Unified Estimates at Statistics Netherlands , 2004 .

[31]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[32]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[33]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[34]  Michael Colledge Frames and Business Registers: An Overview , 2011 .

[35]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[36]  William E. Winkler,et al.  Re-identification Methods for Masked Microdata , 2004, Privacy in Statistical Databases.

[37]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[38]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[39]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[40]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[41]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[42]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[43]  William E. Winkler,et al.  Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems , 2004, Privacy in Statistical Databases.

[44]  Vicenç Torra,et al.  Modeling Decisions: Information Fusion and Aggregation Operators (Cognitive Technologies) , 2006 .

[45]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[46]  Shanti Gomatam,et al.  Record Linkage and Counterterrorism , 2004 .

[47]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[48]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[49]  Andrew Gelman,et al.  Applied Bayesian Modeling And Causal Inference From Incomplete-Data Perspectives , 2005 .

[50]  William E. Winkler 20. Matching and Record Linkage , 2011 .

[51]  Sadaaki Miyamoto,et al.  Evaluating Fuzzy Clustering Algorithms for Microdata Protection , 2004, Privacy in Statistical Databases.