Exploiting Collective Hidden Structures in Webpage Titles for Open Domain Entity Extraction

We present a novel method for open domain named entity extraction by exploiting the collective hidden structures in webpage titles. Our method uncovers the hidden textual structures shared by sets of webpage titles based on generalized URL patterns and a multiple sequence alignment technique. The highlights of our method include: 1) The boundaries of entities can be identified automatically in a collective way without any manually designed pattern, seed or class name. 2) The connections between entities are also discovered naturally based on the hidden structures, which makes it easy to incorporate distant or weak supervision. The experiments show that our method can harvest large scale of open domain entities with high precision. A large ratio of the extracted entities are long-tailed and complex and cover diverse topics. Given the extracted entities and their connections, we further show the effectiveness of our method in a weakly supervised setting. Our method can produce better domain specific entities in both precision and recall compared with the state-of-the-art approaches.

[1]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[2]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[3]  Wei-Ying Ma,et al.  Extracting Objects from the Web , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  Marco Pennacchiotti,et al.  Open Entity Extraction from Web Search Query Logs , 2010, COLING.

[5]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[6]  Chao Zhang,et al.  Bootstrapping Large-scale Named Entities using URL-Text Hybrid Patterns , 2013, IJCNLP.

[7]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[8]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[9]  Lorenzo Blanco,et al.  Highly efficient algorithms for structural clustering of large websites , 2011, WWW.

[10]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[11]  Partha Pratim Talukdar,et al.  Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks , 2008, EMNLP.

[12]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[13]  Hwee Tou Ng,et al.  Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[14]  Doug Downey,et al.  Locating Complex Named Entities in Web Text , 2007, IJCAI.

[15]  Satoshi Sekine,et al.  Extended Named Entity Hierarchy , 2002, LREC.

[16]  Satoshi Sekine,et al.  Named Entity Discovery Using Comparable News Articles , 2004, COLING.

[17]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[18]  Vincent Ng,et al.  Modeling Organization in Student Essays , 2010, EMNLP.

[19]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[20]  Nathanael Chambers,et al.  Template-Based Information Extraction without the Templates , 2011, ACL.

[21]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[22]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[23]  William W. Cohen,et al.  Character-level Analysis of Semi-Structured Documents for Set Expansion , 2009, EMNLP.

[24]  Sepandar D. Kamvar,et al.  An Analytical Comparison of Approaches to Personalizing PageRank , 2003 .

[25]  William W. Cohen,et al.  WebSets: extracting sets of entities from the web using unsupervised information extraction , 2012, WSDM '12.

[26]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[27]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[28]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[29]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[30]  Regina Barzilay,et al.  Bootstrapping Lexical Choice via Multiple-Sequence Alignment , 2002, EMNLP.

[31]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..