Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

[1]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[2]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[3]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[4]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[5]  Forouzan Golshani,et al.  Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[6]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[7]  King-Lup Liu,et al.  Detection of heterogeneities in a multiple text database environment , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[8]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[9]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[10]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[11]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[14]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[15]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[16]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[17]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[18]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[19]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[20]  David G. Stork,et al.  Pattern Classification , 1973 .

[21]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[22]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[23]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[24]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[25]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[26]  Dave Raggett Clean Up Your Web Pages with HTML TIDY , 1999 .

[27]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.

[28]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[29]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[30]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[31]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[32]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[33]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.