A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Entity linking maps name mentions in context to entries in a knowledge base through resolving the name variations and ambiguities. In this paper, we propose two advancements for entity linking. First, a Wikipedia-LDA method is proposed to model the contexts as the probability distributions over Wikipedia categories, which allows the context similarity being measured in a semantic space instead of literal term space used by other studies for the disambiguation. Furthermore, to automate the training instance annotation without compromising the accuracy, an instance selection strategy is proposed to select an informative, representative and diverse subset from an auto-generated dataset. During the iterative selection process, the batch sizes at each iteration change according to the variance of classifier’s confidence or accuracy between batches in sequence, which not only makes the selection insensitive to the initial batch size, but also leads to a better performance. The above two advancements give significant improvements to entity linking individually. Collectively they lead the highest performance on KBP-10 task. Being a generic approach, the batch size changing method can also benefit active learning for other tasks.

[1]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[2]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[3]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[4]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[5]  Ralf Herbrich,et al.  Large margin rank boundaries for ordinal regression , 2000 .

[6]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[7]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[8]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[9]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[10]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[11]  Jian Su,et al.  Multi-Criteria-based Active Learning for Named Entity Recognition , 2004, ACL.

[12]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[13]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[14]  Michael Strube,et al.  Distinguishing between Instances and Classes in the Wikipedia Taxonomy , 2008, ESWC.

[15]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[16]  Andreas Vlachos,et al.  A stopping criterion for active learning , 2008, Comput. Speech Lang..

[17]  Vasudeva Varma,et al.  IIIT Hyderabad at TAC 2009 , 2008, TAC.

[18]  David Yarowsky,et al.  HLTCOE Approaches to Knowledge Base Population at TAC 2009 , 2009, TAC.

[19]  Xianpei Han,et al.  Named entity disambiguation by leveraging wikipedia semantic knowledge , 2009, CIKM.

[20]  Vasudeva Varma,et al.  IIIT Hyderabad at Million Query Track TREC 2009 , 2009, TREC.

[21]  Yang Tang,et al.  THU QUANTA at TAC 2009 KBP and RTE Track , 2009, TAC.

[22]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[23]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[24]  Jian Su,et al.  Entity Linking Leveraging Automatically Generated Annotation , 2010, COLING.

[25]  Mark Dredze,et al.  Entity Disambiguation for Knowledge Base Population , 2010, COLING.