Extracting and Querying a Comprehensive Web Database

Recent research in domain-independent information extraction holds the promise of an automatically-constructed structured database derived from the Web. A query system based on this database would offer the same breadth as a Web search engine, but with much more sophisticated query tools than are common today. Unfortunately, these domain-independent Web extractors are usually not modelindependent; e.g., an extractor that only finds binary relations from text will be blind to relational data found in tables. Because a topic area often has a data model that is a natural fit (e.g., population statistics are usually in tables, while biographical facts about Einstein are embedded in text), even a high-quality domain-independent extractor will miss a substantial amount of data. Our omnivore system attempts to build a comprehensive Web database by running multiple domain-independent extractors in parallel over a Web crawl, then combining their outputs into a single large entity-relationship database. Each item in the database describes a single real-world entity, and can contain information drawn from a number of popular Web data models. The user can correct flaws in the database, and can query it using either a structured query language or a search-like interface. Due to the Web’s sheer size, users cannot be expected to know the result set’s metadata a priori, so omnivore automatically chooses an output model and schema when it renders results. In this paper we outline the omnivore architecture and provide specific details about our current prototype.

[1]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[2]  Luis Gravano,et al.  Snowball: a prototype system for extracting relations from large text collections , 2001, SIGMOD '01.

[3]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[4]  Daisy Zhe Wang,et al.  Uncovering the Relational Web , 2008, WebDB.

[5]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[6]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[7]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[8]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[9]  Philip A. Bernstein,et al.  ModelGen: model independent schema translation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[11]  James Fogarty,et al.  Intelligence in Wikipedia , 2008, AAAI.

[12]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[13]  Andrew McCallum,et al.  Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[14]  Oren Etzioni,et al.  Navigating Extracted Data with Schema Discovery , 2007, WebDB.

[15]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[16]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[17]  Oren Etzioni,et al.  Structured Querying of Web Text Data: A Technical Challenge , 2007, CIDR.

[18]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[19]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[20]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Xiaojin Zhu,et al.  Building Community Wikipedias: A Machine-Human Partnership Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[23]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.