论文信息 - Privacy preserving nearest neighbor search and its applications

Privacy preserving nearest neighbor search and its applications

Data mining is frequently obstructed by privacy concerns. In many cases data is distributed, and bringing the data together in one place for analysis is not possible due to privacy laws or policies. Privacy preserving data mining techniques have been developed to address this issue by providing mechanisms to mine the data while giving privacy guarantees. Private algorithms built on cryptographic techniques, while typically providing stronger privacy guarantees, also suffer from the highest overhead, due to the large computation and communication requirements. For algorithms that are naturally computationally intensive, this typically results in a secure protocol that is unusable on data sets of realistic size. Attempts have been made to alleviate this problem by improving the primitives. However, accomplishing this while maintaining general security remains elusive, leading one to believe that some tradeoffs will need to be made in order to make progress in this area. In this thesis we address the issue of privacy preserving nearest neighbor search, a technique which forms the kernel of many data mining applications. We present a set of novel algorithms based on secure multiparty computation primitives to compute the nearest neighbors of records in horizontally distributed data, which provides a smooth tradeoff between security and efficiency. However, since this algorithm is quadratic in terms of the secure operations, it is too computationally intensive and therefore unusable using typical secure primitives. Thus we investigate methods which make tradeoffs that enable the nearest neighbor search protocol to be usable in practical settings, while still maintaining some guarantees on the privacy of the data. To accomplish this, we develop a novel primitive that makes use of a noncollaborating third party, and is based on polynomial secret sharing, to perform the important secure dot product computation, resulting in a speedup of 1000X to 1500X. We also investigate methods of reducing the search space, resulting in a sub-quadratic algorithm, while minimizing the impact on accuracy and security. We validate this approach through extensive simulations. Finally, we show how this algorithm can be used in three important data mining algorithms, namely LOF outlier detection, SNN clustering, and kNN classification.

Yongdae Kim | Vipin Kumar | Mark Shaneck