论文信息 - Classification-Aware Hidden-Web Text Database Selection

Classification-Aware Hidden-Web Text Database Selection

Many valuable text databases on the web have noncrawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over multiple such "hidden-web" text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining whch databases are the most relevant for a given user query. Our algorithm is the first to construct In this paper we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and a TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches.

P. Kavitha

[1] Tatyana Aleksandrovna Skalozubova,et al. Leaves of Common Nettle (Urtica dioica L.) As a Source of Ascorbic Acid (Vitamin C) , 2013 .

[2] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[3] Donald A. Berry,et al. Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[4] Vladimir G. Andronov. Approximation of Physical Models of Space Scanner Systems , 2013 .

[5] David H. Wolpert,et al. Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[6] Panagiotis G. Ipeirotis,et al. Ranked queries over sources with Boolean query interfaces without ranking support , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[7] Zhiyong Lu,et al. Viewpoint Paper: Evaluating Relevance Ranking Strategies for MEDLINE Retrieval , 2009, J. Am. Medical Informatics Assoc..

[8] Robert Tibshirani,et al. An Introduction to the Bootstrap , 1994 .

[9] Zhu Han,et al. Cross-Layer Optimization , 2014 .

[10] Jongseok Lee,et al. Exploration and Exploitation in the Presence of Network Externalities , 2003, Manag. Sci..

[11] Jayant Madhavan,et al. Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[12] V. Khanaa,et al. An Integrated Agent System for E-mail Coordination using Jade , 2013 .

[13] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[14] Elmer V. Bernstam,et al. Using Incomplete Citation Data for MEDLINE Results Ranking , 2005, AMIA.

[15] Petros Zerfos,et al. Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[16] Ali Mahlooji Far,et al. Retinal Image Analysis Using Curvelet Transform and Multistructure Elements Morphology by Reconstruction , 2011, IEEE Transactions on Biomedical Engineering.

[17] Ihab F. Ilyas,et al. A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[18] P. W. Jones,et al. Bandit Problems, Sequential Allocation of Experiments , 1987 .