CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.

[1]  Yang Li,et al.  Knowledge Verification for LongTail Verticals , 2017, Proc. VLDB Endow..

[2]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[3]  Tim Furche,et al.  Joint repairs for web wrappers , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[4]  Marcin Mironczuk,et al.  The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction , 2018, Knowledge and Information Systems.

[5]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[6]  Valter Crescenzi,et al.  Extraction and Integration of Partially Overlapping Web Sources , 2013, Proc. VLDB Endow..

[7]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[8]  Ralph Grishman,et al.  Information Extraction , 2015, IEEE Intelligent Systems.

[9]  Wei Zhang,et al.  Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources , 2015, Proc. VLDB Endow..

[10]  Joseph Paul Cohen,et al.  Semi-Supervised Web Wrapper Repair via Recursive Tree Matching , 2015, ArXiv.

[11]  Wei Zhang,et al.  From Data Fusion to Knowledge Fusion , 2014, Proc. VLDB Endow..

[12]  Heng Ji,et al.  Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach , 2017, EMNLP.

[13]  Tim Furche,et al.  WADaR: Joint Wrapper and Data Repair , 2015, Proc. VLDB Endow..

[14]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[15]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[19]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[20]  Tim Furche,et al.  DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..

[21]  Tim Furche,et al.  XPath: Looking Forward , 2002, EDBT Workshops.

[22]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[23]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[24]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[25]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[26]  Ralph Grishman,et al.  Structural Linguistics and Unsupervised Information Extraction , 2012, AKBC-WEKEX@NAACL-HLT.

[27]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[28]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[29]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[30]  Rajeev Rastogi,et al.  Exploiting content redundancy for web information extraction , 2010, WWW '10.

[31]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[32]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[33]  Bing Liu,et al.  Extracting Web Data Using Instance-Based Learning , 2007, World Wide Web.

[34]  Wai Lam,et al.  Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach , 2010, IEEE Transactions on Knowledge and Data Engineering.

[35]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[36]  Ziqi Zhang,et al.  Early Steps Towards Web Scale Information Extraction with LODIE , 2015, AI Mag..