A Secure Protocol for Computing String Distance Metrics

An important problem is that of finding matching pairs of records from heterogeneous databases, while maintaining privacy of the database parties. As we have shown in earlier work, distance metrics are a useful tool for record-linkage in many domains, and thus secure computation of distance metrics is quite important for secure record-linkage. In this paper, we consider the computation of a number of distance metrics in a secure multiparty setting. Towards this goal, we propose a stochastic scalar product protocol that is provably consistent, and is also as secure as an underlying set-intersection cryptographic protocol. We then use our stochastic dot product protocol to perform secure computation of some standard distance metrics like TFIDF, SoftTFIDF and the Euclidean Distance Metric. Not only are they asymptotically consistent, but experiments show that the stochastic estimates are also quite close to the true values after just 1000 samples. These secure distance computations can then be used to perform secure matching of records.

[1]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[2]  Mikhail J. Atallah,et al.  A secure protocol for computing dot-products in clustered and distributed environments , 2002, Proceedings International Conference on Parallel Processing.

[3]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[4]  Silvio Micali,et al.  How to play ANY mental game , 1987, STOC.

[5]  Xiaodong Lin,et al.  Privacy preserving regression modelling via distributed computation , 2004, KDD.

[6]  Wenliang Du,et al.  Privacy-preserving cooperative scientific computations , 2001, Proceedings. 14th IEEE Computer Security Foundations Workshop, 2001..

[7]  A KnoblockCraig,et al.  Learning object identification rules for information integration , 2001 .

[8]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[10]  Oded Goldreich,et al.  The Foundations of Cryptography - Volume 2: Basic Applications , 2001 .

[11]  Xiaodong Lin,et al.  Secure Regression on Distributed Databases , 2005 .

[12]  R. Mooney,et al.  Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases , 2002 .

[13]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[14]  Edith Cohen,et al.  Approximating matrix multiplication for pattern recognition tasks , 1997, SODA '97.

[15]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[16]  Gu Si-yang,et al.  Privacy preserving association rule mining in vertically partitioned data , 2006 .

[17]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[18]  Xiaodong Lin,et al.  Analysis of Integrated Data without Data Integration , 2004 .

[19]  Wenliang Du,et al.  Protocols for Secure Remote Database Access with Approximate Matching , 2001, E-Commerce Security and Privacy.

[20]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[21]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[22]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[23]  Dawn Xiaodong Song,et al.  Practical techniques for searches on encrypted data , 2000, Proceeding 2000 IEEE Symposium on Security and Privacy. S&P 2000.

[24]  Chris Clifton,et al.  Secure set intersection cardinality with application to association rule mining , 2005, J. Comput. Secur..

[25]  David Chaum,et al.  Multiparty unconditionally secure protocols , 1988, STOC '88.

[26]  Gerald Salton,et al.  Automatic text processing , 1988 .

[27]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[28]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[29]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[30]  Alexandre V. Evfimievski,et al.  Information sharing across private databases , 2003, SIGMOD '03.