Some methods for blindfolded record linkage

BackgroundThe linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust in the intentions of that party and their ability to maintain security and confidentiality. Dusserre, Quantin, Bouzelat and colleagues have demonstrated that it is possible to use secure one-way hash transformations to carry out follow-up epidemiological studies without any party having to reveal identifying information about any of the subjects – a technique which we refer to as "blindfolded record linkage". A limitation of their method is that only exact comparisons of values are possible, although phonetic encoding of names and other strings can be used to allow for some types of typographical variation and data errors.MethodsA method is described which permits the calculation of a general similarity measure, the n-gram score, without having to reveal the data being compared, albeit at some cost in computation and data communication. This method can be combined with public key cryptography and automatic estimation of linkage model parameters to create an overall system for blindfolded record linkage.ResultsThe system described offers good protection against misdeeds or security failures by any one party, but remains vulnerable to collusion between or simultaneous compromise of two or more parties involved in the linkage operation. In order to reduce the likelihood of this, the use of last-minute allocation of tasks to substitutable servers is proposed. Proof-of-concept computer programmes written in the Python programming language are provided to illustrate the similarity comparison protocol.ConclusionAlthough the protocols described in this paper are not unconditionally secure, they do suggest the feasibility, with the aid of modern cryptographic techniques and high speed communication networks, of a general purpose probabilistic record linkage system which permits record linkage studies to be carried out with negligible risk of invasion of personal privacy.

[1]  William E. Winkler Record Linkage Software and Methods for Merging Administrative Lists , 2001 .

[2]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[3]  L Dusserre,et al.  A one way public key cryptosystem for the linkage of nominal files in epidemiological studies. , 1995, Medinfo. MEDINFO.

[4]  Catherine Quantin,et al.  How to ensure data security of an epidemiological follow-up: quality assessment of an anonymous record linkage procedure , 1998, Int. J. Medical Informatics.

[5]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[6]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[7]  L Dusserre,et al.  Automatic Record Hash Coding and Linkage for Epidemiological Follow-up Data Confidentiality , 1998, Methods of Information in Medicine.

[8]  尚弘 島影 National Institute of Standards and Technologyにおける超伝導研究及び生活 , 2001 .

[9]  Robert F. Boruch,et al.  Assuring the Confidentiality of Social Research Data , 1979 .

[10]  Makoto Yokoo,et al.  Secure Combinatorial Auctions by Dynamic Programming with Polynomial Secret Sharing , 2002, Financial Cryptography.

[11]  Hocine Bouzelat Anonymat et chaînage de fichiers médicaux en vue d'études épidémiologiques , 1998 .

[12]  Joe Kilian,et al.  Uses of randomness in algorithms and protocols , 1990 .

[13]  George Berkeley,et al.  A treatise concerning the principles of human knowledge, 1734 , 1971 .

[14]  Shaun J. Grannis,et al.  Discussion paper: privacy-preserving distributed queries for a clinical case research network , 2002 .

[15]  Catherine Quantin,et al.  Security Aspects of Medical File Regrouping for the Epidemiological Follow-up , 1998, MedInfo.

[16]  Justin M. Reyneri,et al.  Coin flipping by telephone , 1984, IEEE Trans. Inf. Theory.

[17]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[18]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[19]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[20]  A treatise concerning the principles of human knowledge. By George Berkeley ... With prolegomena, and with annotations, select, translated, and original. By Charles P. Krauth , 2003 .

[21]  Ross J. Anderson Security engineering - a guide to building dependable distributed systems (2. ed.) , 2001 .

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[24]  Ross J. Anderson,et al.  Programming Satan's Computer , 1995, Computer Science Today.

[25]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[26]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[27]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[28]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[29]  L Dusserre,et al.  A computerized record hash coding and linkage procedure to warrant epidemiological follow-up data security. , 1997, Studies in health technology and informatics.

[30]  A. J. Bass,et al.  Research use of linked health data — a best practice protocol , 2002, Australian and New Zealand journal of public health.

[31]  Whitfield Diffie,et al.  New Directions in Cryptography , 1976, IEEE Trans. Inf. Theory.

[32]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[33]  T. Blakely,et al.  Anonymous linkage of New Zealand mortality and Census data , 2000, Australian and New Zealand journal of public health.

[34]  L Dusserre,et al.  Extraction and anonymity protocol of medical file. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[35]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[36]  William E. Winkler,et al.  Frequency-Based Matching in the Fellegi-Sunter Model of Record Linkage , 1989 .

[37]  Jules J Berman Threshold protocol for the exchange of confidential medical data , 2002, BMC medical research methodology.

[38]  Peter Christen,et al.  Febrl - Freely extensible biomedical record linkage , 2002 .

[39]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[40]  Ueli Maurer,et al.  Efficient Secure Multi-party Computation , 2000, ASIACRYPT.

[41]  T. Churches A proposed architecture and method of operation for improving the protection of privacy and confidentiality in disease registers , 2003, BMC medical research methodology.

[42]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[43]  Peter Christen,et al.  Preparation of name and address data for record linkage using hidden Markov models , 2002, BMC Medical Informatics Decis. Mak..

[44]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[45]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[46]  Daniel C. Howe,et al.  Free on-line dictionary of computing , 2006 .

[47]  B K Armstrong,et al.  Record linkage ‐a vision renewed , 1999, Australian and New Zealand journal of public health.

[48]  Simon Singh,et al.  The code book : the secret history of codes and codebreaking , 2000 .

[49]  Y Etheridge PKI (public key infrastructure)--how and why it works. , 2001, Health management technology.

[50]  Yves Thibaudeau The Discrimination Power of Dependency Structures in Record Linkage , 1992 .

[51]  K Pommerening,et al.  Pseudonyms for Cancer Registries , 1996, Methods of Information in Medicine.

[52]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.