Parallel Probing of Web Databases for Top-k Query Processing

A “top-k query” specifies a set of preferredvalues for the attributes of a relation and expects as a result thek objects that are “closest” to the given preferences according to some distance function. In many web applications, the relation attributes are only available viaprobesto autonomous webaccessible sources. Probing these sources sequentially to process a topk query is inefficient, since web accesses exhibit high and variable latency. Fortunately, web sources can be probed in parallel, and each source can typically process concurrent requests, although sources may impose some restrictions on the type and number of probes that they are willing to accept. These characteristics of web sources motivate the introduction of parallel top-k query processing strategies, which are the focus of this paper. We present efficient techniques that maximize source-access parallelism to minimize query response time, while satisfying source access constraints. A thorough experimental evaluation over both synthetic and real web sources shows that our techniques can be significantly more efficient than previously proposed sequential strategies. In addition, we adapt our parallel algorithms for the alternate optimization goal of minimizing source load while still exploiting source-access parallelism.