论文信息 - Automated Data Discovery in Similarity Score Queries

Automated Data Discovery in Similarity Score Queries

A vast amount of information is being stored in scientific databases on the web. The dynamic nature of the scientific data, the cost of providing an up-to-date snapshot of the whole database, and proprietary considerations compel the database owners to hide the original data behind search interfaces. The information is often provided to researchers through similarity-search query interfaces, which limits a proper and focused analysis of the data. In this study, we present systematic methods of data discovery through similarity-score queries in such "uncooperative" databases. The methods are generalized to multidimensional data, and to L-p norm distance functions. The accuracy and performance of our methods are demonstrated on synthetic and real-life datasets. The methods developed in this study enable the scientists to obtain the data within the range of their research interests, overcoming the limitations of the similarity-search interface. The results of this study also present implications in data privacy and security areas, where the discovery of the original data is not desired.

Ahmet Sacan | Hakan Ferhatosmanoglu | Ali Saman Tosun | Fatih Altiparmak

[1] Divyakant Agrawal,et al. Vector approximation based indexing for non-uniform high dimensional data sets , 2000, CIKM '00.

[2] Ambuj K. Singh,et al. Scalable access within the context of digital libraries , 1998, International Journal on Digital Libraries.

[3] Klemens Böhm,et al. Trading Quality for Time with Nearest Neighbor Search , 2000, EDBT.

[4] Ambuj K. Singh,et al. Efficient Index Structures for String Databases , 2001, VLDB.

[5] BrightPlanet. The Deep Web : Surfacing Hidden Value. , 2000 .

[6] Marco Patella,et al. PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7] Hans-Peter Kriegel,et al. Nearest Neighbor Classification in 3D Protein Databases , 1999, ISMB.

[8] Divyakant Agrawal,et al. Approximate nearest neighbor searching in multimedia databases , 2001, Proceedings 17th International Conference on Data Engineering.

[9] Divyakant Agrawal,et al. Constrained Nearest Neighbor Queries , 2001, Encyclopedia of GIS.

[10] Luis Gravano,et al. Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[11] Chengyang Zhang,et al. Advances in Spatial and Temporal Databases , 2015, Lecture Notes in Computer Science.

[12] B. Huberman,et al. The Deep Web : Surfacing Hidden Value , 2000 .

[13] M. R. Whalley,et al. The Durham-RAL high-energy physics databases: HEPDATA , 1989 .

[14] Christos Faloutsos,et al. Fast and Effective Retrieval of Medical Tumor Shapes , 1998, IEEE Trans. Knowl. Data Eng..

[15] Jonathan Goldstein,et al. When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[16] Sunil Arya,et al. An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[17] Dennis Shasha,et al. FinTime: a financial time series benchmark , 1999, SGMD.

[18] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[19] King-Lup Liu,et al. Building efficient and effective metasearch engines , 2002, CSUR.

[20] Matthew Goldstein,et al. Kn -nearest Neighbor Classification , 1972, IEEE Trans. Inf. Theory.

[21] Wenliang Du,et al. Protocols for Secure Remote Database Access with Approximate Matching , 2001, E-Commerce Security and Privacy.

[22] Christian Böhm,et al. A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[23] Nick Roussopoulos,et al. Nearest neighbor queries , 1995, SIGMOD '95.

[24] Christos Faloutsos,et al. Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[25] Hakan Ferhatosmanoglu,et al. Vulnerabilities in similarity search based systems , 2002, CIKM '02.