论文信息 - Extracting and Querying a Comprehensive Web Database

Extracting and Querying a Comprehensive Web Database

Recent research in domain-independent information extraction holds the promise of an automatically-constructed structured database derived from the Web. A query system based on this database would offer the same breadth as a Web search engine, but with much more sophisticated query tools than are common today. Unfortunately, these domain-independent Web extractors are usually not modelindependent; e.g., an extractor that only finds binary relations from text will be blind to relational data found in tables. Because a topic area often has a data model that is a natural fit (e.g., population statistics are usually in tables, while biographical facts about Einstein are embedded in text), even a high-quality domain-independent extractor will miss a substantial amount of data. Our omnivore system attempts to build a comprehensive Web database by running multiple domain-independent extractors in parallel over a Web crawl, then combining their outputs into a single large entity-relationship database. Each item in the database describes a single real-world entity, and can contain information drawn from a number of popular Web data models. The user can correct flaws in the database, and can query it using either a structured query language or a search-like interface. Due to the Web’s sheer size, users cannot be expected to know the result set’s metadata a priori, so omnivore automatically chooses an output model and schema when it renders results. In this paper we outline the omnivore architecture and provide specific details about our current prototype.

Michael J. Cafarella

[1] Gerhard Weikum,et al. Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[2] Luis Gravano,et al. Snowball: a prototype system for extracting relations from large text collections , 2001, SIGMOD '01.

[3] Raghu Ramakrishnan,et al. Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[4] Daisy Zhe Wang,et al. Uncovering the Relational Web , 2008, WebDB.

[5] Jens Lehmann,et al. DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[6] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[7] Gerhard Weikum,et al. WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[8] Anastasia Ailamaki,et al. Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[9] Philip A. Bernstein,et al. ModelGen: model independent schema translation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10] Praveen Paritosh,et al. Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[11] James Fogarty,et al. Intelligence in Wikipedia , 2008, AAAI.

[12] Jayant Madhavan,et al. Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[13] Andrew McCallum,et al. Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[14] Oren Etzioni,et al. Navigating Extracted Data with Schema Discovery , 2007, WebDB.

[15] Oren Etzioni,et al. Open Information Extraction from the Web , 2007, CACM.

[16] Doug Downey,et al. Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[17] Oren Etzioni,et al. Structured Querying of Web Text Data: A Technical Challenge , 2007, CIDR.

[18] Daisy Zhe Wang,et al. WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[19] Sriram Raghavan,et al. Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[20] Pedro M. Domingos,et al. Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21] Xiaojin Zhu,et al. Building Community Wikipedias: A Machine-Human Partnership Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22] Philip A. Bernstein,et al. Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[23] Sergey Brin,et al. Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.