Query Routing: Finding Ways in the Maze of the DeepWeb

This paper presents a source selection system based on attribute co-occurrence framework for ranking and selecting Deep Web sources that provide information relevant to users requirement. Given the huge number of heterogeneous Deep Web data sources, the end users may not know the sources that can satisfy their information needs. Selecting and ranking sources in relevance to the user requirements is challenging. Our system finds appropriate sources for such users by allowing them to input just an imprecise initial query. As a key insight, we observe that the semantics and relationships between deep Web sources are self-revealing through their query interfaces, and in essence, through the co-occurrences between attributes. Based on this insight, we design a co-occurrence based attribute graph for capturing the relevances of attributes, and using them in ranking of sources in the order of relevance to user’s requirement. Further, we present an iterative algorithm that realizes our model. Our preliminary evaluation on real-world sources demonstrates the effectiveness of our approach.

[1]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[2]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[3]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[4]  Clement T. Yu,et al.  WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce , 2003, VLDB.

[5]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[6]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[9]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[10]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[11]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[12]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[13]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[14]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[15]  Luis Gravano,et al.  STARTS: Stanford Protocol Proposal for Internet Retrieval and Search , 1997 .

[16]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.