OpenCeres: When Open Information Extraction Meets the Semi-Structured Web

Open Information Extraction (OpenIE), the problem of harvesting triples from natural language text whose predicate relations are not aligned to any pre-defined ontology, has been a popular subject of research for the last decade. However, this research has largely ignored the vast quantity of facts available in semi-structured webpages. In this paper, we define the problem of OpenIE from semi-structured websites to extract such facts, and present an approach for solving it. We also introduce a labeled evaluation dataset to motivate research in this area. Given a semi-structured website and a set of seed facts for some relations existing on its pages, we employ a semi-supervised label propagation technique to automatically create training data for the relations present on the site. We then use this training data to learn a classifier for relation extraction. Experimental results of this method on our new benchmark dataset obtained a precision of over 70%. A larger scale extraction experiment on 31 websites in the movie vertical resulted in the extraction of over 2 million triples.

[1]  Ido Dagan,et al.  Supervised Open Information Extraction , 2018, NAACL.

[2]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[3]  Daisy Zhe Wang,et al.  Ten Years of WebTables , 2018, Proc. VLDB Endow..

[4]  Jayant Krishnamurthy,et al.  Neural Semantic Parsing with Type Constraints for Semi-Structured Tables , 2017, EMNLP.

[5]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[6]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[7]  William W. Cohen,et al.  WebSets: extracting sets of entities from the web using unsupervised information extraction , 2012, WSDM '12.

[8]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[9]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[10]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[11]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[12]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[13]  Ming Zhou,et al.  Neural Open Information Extraction , 2018, ACL.

[14]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[15]  Xin Luna Dong,et al.  CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web , 2018, Proc. VLDB Endow..

[16]  Mausam,et al.  Open Information Extraction Systems and Downstream Applications , 2016, IJCAI.

[17]  André Freitas,et al.  A Survey on Open Information Extraction , 2018, COLING.

[18]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[19]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[20]  Valter Crescenzi,et al.  Extraction and Integration of Partially Overlapping Web Sources , 2013, Proc. VLDB Endow..

[21]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[22]  Ziqi Zhang,et al.  Early Steps Towards Web Scale Information Extraction with LODIE , 2015, AI Mag..

[23]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[24]  Jayant Madhavan,et al.  Applying WebTables in Practice , 2015, CIDR.

[25]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26]  Tim Furche,et al.  DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..