论文信息 - OpenCeres: When Open Information Extraction Meets the Semi-Structured Web - 字舞流文

OpenCeres: When Open Information Extraction Meets the Semi-Structured Web

Open Information Extraction (OpenIE), the problem of harvesting triples from natural language text whose predicate relations are not aligned to any pre-defined ontology, has been a popular subject of research for the last decade. However, this research has largely ignored the vast quantity of facts available in semi-structured webpages. In this paper, we define the problem of OpenIE from semi-structured websites to extract such facts, and present an approach for solving it. We also introduce a labeled evaluation dataset to motivate research in this area. Given a semi-structured website and a set of seed facts for some relations existing on its pages, we employ a semi-supervised label propagation technique to automatically create training data for the relations present on the site. We then use this training data to learn a classifier for relation extraction. Experimental results of this method on our new benchmark dataset obtained a precision of over 70%. A larger scale extraction experiment on 31 websites in the movie vertical resulted in the extraction of over 2 million triples.

Xin Dong | Colin Lockard | Prashant Shiralkar | Colin Lockard | Xin Dong | Prashant Shiralkar

[1] Ido Dagan,et al. Supervised Open Information Extraction , 2018, NAACL.

[2] Oren Etzioni,et al. Open Information Extraction from the Web , 2007, CACM.

[3] Daisy Zhe Wang,et al. Ten Years of WebTables , 2018, Proc. VLDB Endow..

[4] Jayant Krishnamurthy,et al. Neural Semantic Parsing with Type Constraints for Semi-Structured Tables , 2017, EMNLP.

[5] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[6] Percy Liang,et al. Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[7] William W. Cohen,et al. WebSets: extracting sets of entities from the web using unsupervised information extraction , 2012, WSDM '12.

[8] Stephen Soderland,et al. Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[9] Oren Etzioni,et al. Open Language Learning for Information Extraction , 2012, EMNLP.

[10] Daisy Zhe Wang,et al. WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[11] Wei Zhang,et al. Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[12] Nicholas Kushmerick,et al. Wrapper Induction for Information Extraction , 1997, IJCAI.

[13] Ming Zhou,et al. Neural Open Information Extraction , 2018, ACL.

[14] Daniel S. Weld,et al. Open Information Extraction Using Wikipedia , 2010, ACL.

[15] Xin Luna Dong,et al. CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web , 2018, Proc. VLDB Endow..

[16] Mausam,et al. Open Information Extraction Systems and Downstream Applications , 2016, IJCAI.

[17] André Freitas,et al. A Survey on Open Information Extraction , 2018, COLING.

[18] Oren Etzioni,et al. Identifying Relations for Open Information Extraction , 2011, EMNLP.

[19] Qiang Hao,et al. From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[20] Valter Crescenzi,et al. Extraction and Integration of Partially Overlapping Web Sources , 2013, Proc. VLDB Endow..

[21] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[22] Ziqi Zhang,et al. Early Steps Towards Web Scale Information Extraction with LODIE , 2015, AI Mag..

[23] Ravi Kumar,et al. Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[24] Jayant Madhavan,et al. Applying WebTables in Practice , 2015, CIDR.

[25] Rajeev Rastogi,et al. Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26] Tim Furche,et al. DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..