A Two-Stage Joint Model for Domain-Specific Entity Detection and Linking Leveraging an Unlabeled Corpus

The intensive construction of domain-specific knowledge bases (DSKB) has posed an urgent demand for researches about domain-specific entity detection and linking (DSEDL). Joint models are usually adopted in DSEDL tasks, but data imbalance and high computational complexity exist in these models. Besides, traditional feature representation methods are insufficient for domain-specific tasks, due to problems such as lack of labeled data, link sparseness in DSKBs, and so on. In this paper, a two-stage joint (TSJ) model is proposed to solve the data imbalance problem by discriminatively processing entity mentions with different degrees of ambiguity. In addition, three novel methods are put forward to generate effective features by incorporating an unlabeled corpus. One crucial feature involving entity detection is the mention type, extracted by a long short-term memory (LSTM) model trained on automatically annotated data. The other two types of features mainly involve entity linking, including the inner-document topical coherence, which is measured based on entity co-occurring relationships in the corpus, and the cross-document entity coherence evaluated using similar documents. An overall 74.26% F1 value is obtained on a dataset of real-world movie comments, demonstrating the effectiveness of the proposed approach and indicating its potentiality to be used in real-world domain-specific applications.

[1]  Gerhard Paass,et al.  From names to entities using thematic context distance , 2011, CIKM '11.

[2]  Juan-Zi Li,et al.  Domain-Specific Entity Linking via Fake Named Entity Detection , 2016, DASFAA.

[3]  Vasudeva Varma,et al.  IIIT Hyderabad at TAC 2009 , 2008, TAC.

[4]  Xianpei Han,et al.  A Generative Entity-Mention Model for Linking Entities with Knowledge Base , 2011, ACL.

[5]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[6]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[7]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[8]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[9]  Juan-Zi Li,et al.  Graph-Based Jointly Modeling Entity Detection and Linking in Domain-Specific Area , 2016, CCKS.

[10]  Qingcai Chen,et al.  ICRC-DSEDL: A Film Named Entity Discovery and Linking System Based on Knowledge Bases , 2016, CCKS.

[11]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[12]  Ting Liu,et al.  Microblog Entity Linking by Leveraging Extra Posts , 2013, EMNLP.

[13]  Xianpei Han,et al.  An Entity-Topic Model for Entity Linking , 2012, EMNLP.

[14]  Yang Li,et al.  Mining evidences for named entity disambiguation , 2013, KDD.

[15]  Jun Zhao,et al.  Collective entity linking in web text: a graph-based method , 2011, SIGIR.

[16]  Heng Ji,et al.  Collaborative Ranking: A Case Study on Entity Linking , 2011, EMNLP.

[17]  Avirup Sil,et al.  Re-ranking for joint named-entity recognition and linking , 2013, CIKM.

[18]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[19]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[20]  Jian Su,et al.  Entity Linking Leveraging Automatically Generated Annotation , 2010, COLING.

[21]  Xiao Li,et al.  Domain-Specific Entity Discovery and Linking Task , 2016, CCKS.

[22]  Jian Su,et al.  Entity Linking with Effective Acronym Expansion, Instance Selection, and Topic Modeling , 2011, IJCAI.

[23]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[24]  Jing Jiang,et al.  Linking Entities to a Knowledge Base with Query Expansion , 2011, EMNLP.

[25]  Yitong Li,et al.  Entity Linking for Tweets , 2013, ACL.

[26]  Jian Su,et al.  NUS-I2R: Learning a Combined System for Entity Linking , 2010, TAC.

[27]  Yang Li,et al.  Entity Disambiguation with Linkless Knowledge Bases , 2016, WWW.

[28]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[29]  Hermann Ney,et al.  Maximum Entropy Models for Named Entity Recognition , 2003, CoNLL.

[30]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[31]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[32]  Wei Shen,et al.  Linking named entities in Tweets with knowledge base via user interest modeling , 2013, KDD.

[33]  Ming-Wei Chang,et al.  To Link or Not to Link? A Study on End-to-End Tweet Entity Linking , 2013, NAACL.

[34]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[35]  James Hammerton,et al.  Named Entity Recognition with Long Short-Term Memory , 2003, CoNLL.

[36]  Wei Shen,et al.  LINDEN: linking named entities with knowledge base via semantic knowledge , 2012, WWW.

[37]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Michael Granitzer,et al.  Robust and Collective Entity Disambiguation through Semantic Embeddings , 2016, SIGIR.