An Approach to Web-Scale Named-Entity Disambiguation

We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasingly difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information from documents.

[1]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[2]  Nuno Seco,et al.  HAREM: An Advanced NER Evaluation Contest for Portuguese , 2006, LREC.

[3]  Oren Etzioni,et al.  Unsupervised Resolution of Objects and Relations on the Web , 2007, NAACL.

[4]  Lyle H. Ungar,et al.  Web-scale named entity recognition , 2008, CIKM '08.

[5]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[6]  Robert Krovetz,et al.  Homonymy and Polysemy in Information Retrieval , 1997, ACL.

[7]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[8]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[9]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[10]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[11]  Fredric C. Gey,et al.  Proceedings of LREC , 2010 .

[12]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[13]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[14]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[15]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[19]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[20]  Eugénio C. Oliveira,et al.  Efficient Clustering of Web-Derived Data Sets , 2009, MLDM.

[21]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[22]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.