Integrating deep web data sources

A large number of data sources on the Web (e.g., Amazon.com) are only accessible through their query interfaces. These sources are commonly known as Deep Web sources. For any domain of interest, there may be many such sources with varied query capabilities and content coverage. As a result, users frequently need to access multiple sources in order to find the desired information, which can be a very time-consuming and labor-expensive process. To address this problem, an effective solution is to build a virtual integration system over the sources. Such a system provides uniform accesses to the sources, thus freeing the users from the details of individual sources. As an important step towards this goal, this dissertation studies the problem of integrating query interfaces of Deep Web sources. Interface integration typically involves three very challenging tasks: (1) schema extraction , which infers the schema of each source query interface from its (HTML) representation; (2) schema matching, which accurately identifies semantic mappings among the attributes from different interfaces; and (3) schema merging, which properly merges the source interfaces into a well-formed global interface based on the identified attribute mappings. This dissertation presents IceQ, a novel and effective interface integration system. In developing IceQ, we address the limitations of existing solutions and make several key contributions. First, we propose a hierarchical modeling of interfaces and develop a novel spatial clustering algorithm to extract the hierarchical schema of query interface. Second, we develop a novel interactive clustering-based matching algorithm to accurately match a large number of schemas and effectively resolve uncertain mappings via user interaction. Third, we develop a question-answering technique to learn attribute instances from the Web to assist in schema matching. Fourth, we propose a novel constraint-based optimization framework for merging schemas and develop an effective merging algorithm based on the idea of clustering aggregation. Extensive experiments have been conducted to evaluate IceQ and the results show that it is highly effective.

[1]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[2]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[3]  Alfred V. Aho,et al.  Inferring a Tree from Lowest Common Ancestors with an Application to the Optimization of Relational Expressions , 1981, SIAM J. Comput..

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  C. Batini,et al.  A comparative analysis of methodologies for database schema integration , 1986, CSUR.

[6]  James A. Larson,et al.  A Theory of Attribute Equivalence in Databases with Application to Schema Integration , 1989, IEEE Trans. Software Eng..

[7]  A. Sheth Federated database systems for managing distributed, heterogeneous, and autonomous databases , 1990, CSUR.

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[9]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[10]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[11]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[12]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[13]  Jennifer Widom,et al.  Integrating heterogeneous databases: lazy or eager? , 1996, CSUR.

[14]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[15]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[16]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[17]  Philippe Bonnet,et al.  The distributed information search component (Disco) and the World Wide Web , 1997, SIGMOD '97.

[18]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[19]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[20]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[21]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[22]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[23]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[24]  Stefano Spaccapietra,et al.  Issues and approaches of database integration , 1998, CACM.

[25]  Luis Gravano,et al.  Mediating and Metasearching on the Internet , 1998, IEEE Data Eng. Bull..

[26]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[27]  King-Lup Liu,et al.  Estimating the usefulness of search engines , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[28]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[29]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[30]  King-Lup Liu,et al.  Finding the most similar documents across multiple text databases , 1999, Proceedings IEEE Forum on Research and Technology Advances in Digital Libraries.

[31]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[32]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[33]  King-Lup Liu,et al.  Efficient and effective metasearch for a large number of text databases , 1999, CIKM '99.

[34]  BrightPlanet The Deep Web : Surfacing Hidden Value. , 2000 .

[35]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[36]  Erhard Rahm,et al.  Data Warehouse Scenarios for Model Management , 2000, ER.

[37]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[38]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[39]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[40]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[41]  Andreas Paepcke,et al.  Efficient Web form entry on PDAs , 2001, WWW '01.

[42]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[43]  King-Lup Liu,et al.  Efficient and effective metasearch for text databases incorporating linkages among documents , 2001, SIGMOD '01.

[44]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[45]  Dragomir R. Radev,et al.  Mining the web for answers to natural language questions , 2001, CIKM '01.

[46]  Daphne Koller,et al.  Active learning: theory and applications , 2001 .

[47]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[48]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[49]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[50]  A. Östlin Constructing Evolutionary Trees - Algorithms and Complexity , 2001 .

[51]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[52]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[53]  Kaizhong Zhang,et al.  Finding similar consensus between trees: an algorithm and a distance hierarchy , 2001, Pattern Recognit..

[54]  Craig A. Knoblock,et al.  The Ariadne Approach to Web-Based Information Integration , 2001, Int. J. Cooperative Inf. Syst..

[55]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[56]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[57]  King-Lup Liu,et al.  A Statistical Method for Estimating the Usefulness of Text Databases , 2002, IEEE Trans. Knowl. Data Eng..

[58]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[59]  Bernardo Magnini,et al.  Is It the Right Answer? Exploiting Web Redundancy for Answer Validation , 2002, ACL.

[60]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.

[61]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[62]  Joseph M. Hellerstein,et al.  Partial results for online query processing , 2002, SIGMOD '02.

[63]  Alon Y. Halevy,et al.  Efficiently ordering query plans for data integration , 1999, Proceedings 18th International Conference on Data Engineering.

[64]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[65]  Mohand Boughanem,et al.  IRIT at TREC 2002: Web Track , 2002, TREC.

[66]  Renée J. Miller,et al.  Mapping Adaptation under Evolving Schemas , 2003, VLDB.

[67]  AnHai Doan,et al.  Building Data Integration Systems via Mass Collaboration , 2003 .

[68]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[69]  Jayant Madhavan,et al.  Corpus-Based Knowledge Representation , 2003, IJCAI.

[70]  Philip A. Bernstein,et al.  Merging Models Based on Given Correspondences , 2003, VLDB.

[71]  Clement T. Yu,et al.  WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce , 2003, VLDB.

[72]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[73]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[74]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[75]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[76]  Jiawei Han,et al.  Discovering complex matchings across web query interfaces: a correlation mining approach , 2004, KDD.

[77]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[78]  Paul A. Viola,et al.  Interactive Information Extraction with Constrained Conditional Random Fields , 2004, AAAI.

[79]  Dan Suciu,et al.  The Piazza peer data management system , 2004, IEEE Transactions on Knowledge and Data Engineering.

[80]  Clement T. Yu,et al.  Database Selection for Longer Queries , 2004 .

[81]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[82]  Clement T. Yu,et al.  An interactive clustering-based approach to integrating source query interfaces on the deep Web , 2004, SIGMOD '04.

[83]  Pushpak Bhattacharyya,et al.  Is question answering an acquired skill? , 2004, WWW '04.

[84]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[85]  Steffen Staab,et al.  Towards the self-annotating web , 2004, WWW '04.

[86]  Wei-Ying Ma,et al.  Instance-based Schema Matching for Web Databases by Domain-specific Query Probing , 2004, VLDB.

[87]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[88]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[89]  Alon Y. Halevy,et al.  Semantic Integration Research in the Database Community : A Brief Survey , 2005 .

[90]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[91]  Clement T. Yu,et al.  Merging interface schemas on the deep Web via clustering aggregation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[92]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[93]  Subbarao Kambhampati,et al.  Effectively mining and using coverage and overlap statistics for data integration , 2005, IEEE Transactions on Knowledge and Data Engineering.

[94]  Clement T. Yu,et al.  Bootstrapping Domain Ontology for Semantic Web Services from Source Web Sites , 2005, TES.

[95]  ChengXiang Zhai,et al.  Active feedback in ad hoc information retrieval , 2005, SIGIR '05.

[96]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[97]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[98]  Clement T. Yu,et al.  Constructing Interface Schemas for Search Interfaces of Web Databases , 2005, WISE.

[99]  Alin Deutsch,et al.  Interactive query formulation over web service-accessed sources , 2006, SIGMOD Conference.

[100]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[101]  Clement T. Yu,et al.  Merging Source Query Interfaces onWeb Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[102]  Clement T. Yu,et al.  WebIQ: Learning from the Web to Match Deep-Web Query Interfaces , 2006, 22nd International Conference on Data Engineering (ICDE'06).