Classification-Aware Hidden-Web Text Database Selection

Many valuable text databases on the web have noncrawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over multiple such "hidden-web" text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining whch databases are the most relevant for a given user query. Our algorithm is the first to construct In this paper we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and a TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches.

[1]  Tatyana Aleksandrovna Skalozubova,et al.  Leaves of Common Nettle (Urtica dioica L.) As a Source of Ascorbic Acid (Vitamin C) , 2013 .

[2]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[3]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[4]  Vladimir G. Andronov Approximation of Physical Models of Space Scanner Systems , 2013 .

[5]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[6]  Panagiotis G. Ipeirotis,et al.  Ranked queries over sources with Boolean query interfaces without ranking support , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[7]  Zhiyong Lu,et al.  Viewpoint Paper: Evaluating Relevance Ranking Strategies for MEDLINE Retrieval , 2009, J. Am. Medical Informatics Assoc..

[8]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[9]  Zhu Han,et al.  Cross-Layer Optimization , 2014 .

[10]  Jongseok Lee,et al.  Exploration and Exploitation in the Presence of Network Externalities , 2003, Manag. Sci..

[11]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[12]  V. Khanaa,et al.  An Integrated Agent System for E-mail Coordination using Jade , 2013 .

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Elmer V. Bernstam,et al.  Using Incomplete Citation Data for MEDLINE Results Ranking , 2005, AMIA.

[15]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[16]  Ali Mahlooji Far,et al.  Retinal Image Analysis Using Curvelet Transform and Multistructure Elements Morphology by Reconstruction , 2011, IEEE Transactions on Biomedical Engineering.

[17]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[18]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .