Efficient Computation of Statistical Significance of Query Results in Databases

Queries such as database similarity searches return results satisfying certain properties of distances or scores. For domain scientists, the absolute values of scores are seldom sufficient. Statistical significance or p-valueof the result is a more useful criterion. This can be computed using an appropriate model of random objects. The problem of computing p-values becomes more acute when queries have multiple components. In this case, the returned score is an aggregate of individual scores. The simple way of calculating the p-value by enumerating all random possibilities fails for large database and query sizes. We propose an efficient method to calculate the approximate p-value of a multi-attribute result when the distribution of scores for the database objects is non-parametric. Experimental evaluation on large databases shows that our method is practical, runs 5 orders of magnitude faster than the basic approach, and has an error of less than 5% in p-value computation.