Bootstrapping Large-scale Named Entities using URL-Text Hybrid Patterns

Automatically mining named entities (NE) is an important but challenging task, pattern-based and bootstrapping strategy is the most widely accepted solution. In this paper, we propose a novel method for NE mining using web document titles. In addition to the traditional text patterns, we propose to use url-text hybrid patterns that introduce url criterion to better pinpoint high-quality NEs. We also design a multiclass collaborative learning mechanism in bootstrapping, in which different patterns and different classes work together to determine better patterns and NE instances. Experimental results show that the precision of NEs mined with the proposed method is 0.96 and 0.94 on Chinese and English corpora, respectively. Comparison result also shows that the proposed method significantly outperforms a representative method that mines NEs from large-scale query logs.

[1]  William W. Cohen,et al.  WebSets: extracting sets of entities from the web using unsupervised information extraction , 2012, WSDM '12.

[2]  William W. Cohen,et al.  Automatic Set Instance Extraction using the Web , 2009, ACL/IJCNLP.

[3]  Marius Pasca,et al.  Acquisition of categorized named entities for web search , 2004, CIKM '04.

[4]  Doug Downey,et al.  Locating Complex Named Entities in Web Text , 2007, IJCAI.

[5]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[6]  Zornitsa Kozareva,et al.  A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web , 2010, EMNLP.

[7]  Marius Pasca,et al.  Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[8]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[9]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[10]  Thorsten Brants,et al.  A Context Pattern Induction Method for Named Entity Extraction , 2006, CoNLL.

[11]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[12]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[13]  Kentaro Torisawa,et al.  Inducing Gazetteers for Named Entity Recognition by Large-Scale Clustering of Dependency Relations , 2008, ACL.

[14]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[15]  William W. Cohen,et al.  Iterative Set Expansion of Named Entities Using the Web , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[16]  Lorenzo Blanco,et al.  Highly efficient algorithms for structural clustering of large websites , 2011, WWW.