Surrogate ranking for very expensive similarity queries

We consider the problem of similarity search in applications where the cost of computing the similarity between two records is very expensive, and the similarity measure is not a metric. In such applications, comparing even a tiny fraction of the database records to a single query record can be orders of magnitude slower than reading the entire database from disk, and indexing is often not possible. We develop a general-purpose, statistical framework for answering top-k queries in such databases, when the database administrator is able to supply an inexpensive surrogate ranking function that substitutes for the actual similarity measure. We develop a robust method that learns the relationship between the surrogate function and the similarity measure. Given a query, we use Bayesian statistics to update the model by taking into account the observed partial results. Using the updated model, we construct bounds on the accuracy of the result set obtained via the surrogate ranking. Our experiments show that our models can produce useful bounds for several real-life applications.

[1]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[2]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[3]  Sunil Arya,et al.  Approximate nearest neighbor queries in fixed dimensions , 1993, SODA '93.

[4]  A. W. Kemp,et al.  Kendall's Advanced Theory of Statistics. , 1994 .

[5]  Scott R. Eliason Maximum likelihood estimation: Logic and practice. , 1994 .

[6]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[7]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[8]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[9]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[10]  Hans-Peter Kriegel,et al.  Fast nearest neighbor search in high-dimensional space , 1998, Proceedings 14th International Conference on Data Engineering.

[11]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[12]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[13]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[14]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[15]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[16]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[17]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[18]  Hege S. Beard,et al.  Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. , 2004, Journal of medicinal chemistry.

[19]  W. L. Jorgensen The Many Roles of Computation in Drug Discovery , 2004, Science.

[20]  Anthony O'Hagan,et al.  Kendall's Advanced Theory of Statistics, volume 2B: Bayesian Inference, second edition , 2004 .

[21]  Ron Y. Pinter,et al.  Alignment of metabolic pathways , 2005, Bioinform..

[22]  John J Irwin,et al.  Here Be Dragons: Docking and Screening in an Uncharted Region of Chemical Space , 2005, Journal of biomolecular screening.

[23]  Juhan Kim,et al.  Why metabolic enzymes are essential or nonessential for growth of Escherichia coli K12 on glucose. , 2007, Biochemistry.

[24]  Anthony K. H. Tung,et al.  A graph method for keyword-based selection of the top-K databases , 2008, SIGMOD Conference.

[25]  S. E. Ahmed,et al.  Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference , 2008, Technometrics.

[26]  Tamer Kahveci,et al.  Consistent alignment of metabolic pathways without abstraction. , 2008, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[27]  Byron Hall Bayesian Inference , 2011 .