Biperpedia: An Ontology for Search Applications

Search engines make significant efforts to recognize queries that can be answered by structured data and invest heavily in creating and maintaining high-precision databases. While these databases have a relatively wide coverage of entities, the number of attributes they model (e.g., GDP, CAPITAL, ANTHEM) is relatively small. Extending the number of attributes known to the search engine can enable it to more precisely answer queries from the long and heavy tail, extract a broader range of facts from the Web, and recover the semantics of tables on the Web. We describe Biperpedia, an ontology with 1.6M (class, attribute) pairs and 67K distinct attribute names. Biperpedia extracts attributes from the query stream, and then uses the best extractions to seed attribute extraction from text. For every attribute Biperpedia saves a set of synonyms and text patterns in which it appears, thereby enabling it to recognize the attribute in more contexts. In addition to a detailed analysis of the quality of Biperpedia, we show that it can increase the number of Web tables whose semantics we can recover by more than a factor of 4 compared with Freebase.

[1]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[2]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[3]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[4]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[5]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[6]  Chin-Wan Chung,et al.  An effective semantic search technique using ontology , 2009, WWW '09.

[7]  Benjamin Van Durme,et al.  What You Seek Is What You Get: Extraction of Class Attributes from Query Logs , 2007, IJCAI.

[8]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[9]  Sebastian Rudolph,et al.  Ontology-Based Interpretation of Keywords for Semantic Search , 2007, ISWC/ASWC.

[10]  Benjamin Van Durme,et al.  The role of documents vs. queries in extracting class attributes from text , 2007, CIKM '07.

[11]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[12]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[13]  Seung-won Hwang,et al.  Attribute extraction and scoring: A probabilistic approach , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[14]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[15]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[16]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[17]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[18]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[19]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[20]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[21]  Andrew McCallum,et al.  Collective Cross-Document Relation Extraction Without Labelled Data , 2010, EMNLP.

[22]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[23]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[24]  Dan Klein,et al.  Simple Coreference Resolution with Rich Syntactic and Semantic Features , 2009, EMNLP.

[25]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[26]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[27]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.