Automated Data Discovery in Similarity Score Queries

A vast amount of information is being stored in scientific databases on the web. The dynamic nature of the scientific data, the cost of providing an up-to-date snapshot of the whole database, and proprietary considerations compel the database owners to hide the original data behind search interfaces. The information is often provided to researchers through similarity-search query interfaces, which limits a proper and focused analysis of the data. In this study, we present systematic methods of data discovery through similarity-score queries in such "uncooperative" databases. The methods are generalized to multidimensional data, and to L-p norm distance functions. The accuracy and performance of our methods are demonstrated on synthetic and real-life datasets. The methods developed in this study enable the scientists to obtain the data within the range of their research interests, overcoming the limitations of the similarity-search interface. The results of this study also present implications in data privacy and security areas, where the discovery of the original data is not desired.

[1]  Divyakant Agrawal,et al.  Vector approximation based indexing for non-uniform high dimensional data sets , 2000, CIKM '00.

[2]  Ambuj K. Singh,et al.  Scalable access within the context of digital libraries , 1998, International Journal on Digital Libraries.

[3]  Klemens Böhm,et al.  Trading Quality for Time with Nearest Neighbor Search , 2000, EDBT.

[4]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[5]  BrightPlanet The Deep Web : Surfacing Hidden Value. , 2000 .

[6]  Marco Patella,et al.  PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Hans-Peter Kriegel,et al.  Nearest Neighbor Classification in 3D Protein Databases , 1999, ISMB.

[8]  Divyakant Agrawal,et al.  Approximate nearest neighbor searching in multimedia databases , 2001, Proceedings 17th International Conference on Data Engineering.

[9]  Divyakant Agrawal,et al.  Constrained Nearest Neighbor Queries , 2001, Encyclopedia of GIS.

[10]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[11]  Chengyang Zhang,et al.  Advances in Spatial and Temporal Databases , 2015, Lecture Notes in Computer Science.

[12]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[13]  M. R. Whalley,et al.  The Durham-RAL high-energy physics databases: HEPDATA , 1989 .

[14]  Christos Faloutsos,et al.  Fast and Effective Retrieval of Medical Tumor Shapes , 1998, IEEE Trans. Knowl. Data Eng..

[15]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[16]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[17]  Dennis Shasha,et al.  FinTime: a financial time series benchmark , 1999, SGMD.

[18]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[19]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[20]  Matthew Goldstein,et al.  Kn -nearest Neighbor Classification , 1972, IEEE Trans. Inf. Theory.

[21]  Wenliang Du,et al.  Protocols for Secure Remote Database Access with Approximate Matching , 2001, E-Commerce Security and Privacy.

[22]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[23]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[24]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[25]  Hakan Ferhatosmanoglu,et al.  Vulnerabilities in similarity search based systems , 2002, CIKM '02.