论文信息 - Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

[1] Luis Gravano,et al. QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[2] Luis Gravano,et al. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[3] Tobias Dönz. Extracting Structured Data from Web Pages , 2003 .

[4] Jon M. Kleinberg,et al. Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[5] Forouzan Golshani,et al. Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[6] G. Karypis,et al. Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[7] King-Lup Liu,et al. Detection of heterogeneities in a multiple text database environment , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[8] James P. Callan,et al. Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[9] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[10] Ziv Bar-Yossef,et al. Template detection via data mining and its applications , 2002, WWW.

[11] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.

[12] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[14] Michael K. Bergman. White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[15] Shigeo Abe DrEng. Pattern Classification , 2001, Springer London.

[16] G. Karypis,et al. Criterion functions for document clustering , 2005 .

[17] Inderjit S. Dhillon,et al. Efficient Clustering of Very Large Document Collections , 2001 .

[18] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[19] B. Huberman,et al. The Deep Web : Surfacing Hidden Value , 2000 .