Deep web search: an overview and roadmap

We review the state-of-the-art in deep web search and propose a novel classification scheme to better compare deep web search systems. The current binary classification (surfacing versus virtual integration) hides a number of implicit decisions that must be made by a developer. We make these decisions explicit by distinguishing 7 system aspects that describe a system in terms of its functionality (what it can, and what it cannot do) and in terms of its solution to a specific problem. We then motivate the need for a search system which has a single-field free-text query interface that supports real-time structured search over multiple sources. To this end, we discuss two possible federated architectures and state the scientific challenges. Finally, we present the findings of our ongoing project and briefly outline related work to free-text interfaces over structured data.

[1]  Michael J. Cafarella Extracting and Querying a Comprehensive Web Database , 2009, CIDR.

[2]  Clement T. Yu,et al.  A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration , 2009, Proc. VLDB Endow..

[3]  Jayant Madhavan,et al.  Web-scale extraction of structured data , 2009, SGMD.

[4]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[5]  Clement T. Yu,et al.  Automatic integration of Web search interfaces with WISE-Integrator , 2004, The VLDB Journal.

[6]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[7]  Jaime G. Carbonell,et al.  Dynamic Strategy Selection in Flexible Parsing , 1981, ACL.

[8]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[9]  Jaime G. Carbonell,et al.  The XCALIBUR Project: A Natural Language Interface to Expert Systems , 1983, IJCAI.

[10]  Frank Meng A natural language interface for information retrieval from forms on the World Wide Web , 1999, ICIS.

[11]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[12]  Li Huang,et al.  Organizing Structured Deep Web by Clustering Query Interfaces Link Graph , 2008, ADMA.

[13]  Matthias Hagen,et al.  Query segmentation revisited , 2011, WWW.

[14]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  Yuan An,et al.  Understanding deep web search interfaces: a survey , 2010, SGMD.

[16]  Panayiotis Tsaparas,et al.  Structured annotations of web queries , 2010, SIGMOD Conference.

[17]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[18]  Luo Si,et al.  Modeling search engine effectiveness for federated search , 2005, SIGIR '05.

[19]  Victor Carneiro,et al.  DeepBot: a focused crawler for accessing hidden web content , 2007, DEECS '07.

[20]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[21]  Clement T. Yu,et al.  Deriving Customized Integrated Web Query Interfaces , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[22]  Clement T. Yu,et al.  Deep web integration with VisQI , 2010, Proc. VLDB Endow..

[23]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[24]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[25]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[26]  Xiao Li,et al.  Extracting structured information from user queries with semi-supervised conditional random fields , 2009, SIGIR.

[27]  W. Bruce Croft,et al.  Joint Annotation of Search Queries , 2011, ACL.

[28]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[29]  Berthier A. Ribeiro-Neto,et al.  Searching web databases by structuring keyword-based queries , 2002, CIKM '02.

[30]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[31]  Chong Wang,et al.  SPARK: Adapting Keyword Query to Semantic Search , 2007, ISWC/ASWC.

[32]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[33]  Tao Tao,et al.  Organizing structured web sources by query schemas: a clustering approach , 2004, CIKM '04.

[34]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[35]  Juliana Freire,et al.  An adaptive crawler for locating hidden-Web entry points , 2007, WWW '07.

[36]  Gary G. Hendrix,et al.  Developing a natural language interface to complex data , 1977, TODS.

[37]  Kevin Chen-Chuan Chang,et al.  MetaQuerier: querying structured web sources on-the-fly , 2005, SIGMOD '05.

[38]  Loredana Afanasiev,et al.  Harnessing the Deep Web: Present and Future , 2009, CIDR.

[39]  Peter Thanisch,et al.  Natural language interfaces to databases – an introduction , 1995, Natural Language Engineering.