WN-Salience: A Corpus of News Articles with Entity Salience Annotations

Entities can be found in various text genres, ranging from tweets and web pages to user queries submitted to web search engines. Existing research either considers all entities in the text equally important, or heuristics are used to measure their salience. We believe that a key reason for the relatively limited work on entity salience is the lack of appropriate datasets. To support research on entity salience, we present a new dataset, the WikiNews Salience dataset (WN-Salience), which can be used to benchmark tasks such as entity salience detection and salient entity linking. WN-Salience is built on top of Wikinews, a Wikimedia project whose mission is to present reliable news articles. Entities in Wikinews articles are identified by the authors of the articles and are linked to Wikinews categories when they are salient or to Wikipedia pages otherwise. The dataset is built automatically, and consists of approximately 7,000 news articles, and 90,000 in-text entity annotations. We compare the WN-Salience dataset against existing datasets on the task and analyze their differences. Furthermore, we conduct experiments on entity salience detection; the results demonstrate that WN-Salience is a challenging testbed that is complementary to existing ones.

[1]  Claudia Niederée,et al.  Balancing Novelty and Salience: Adaptive Learning to Rank Entities for Timeline Summarization of High-impact Events , 2015, CIKM.

[2]  Maarten de Rijke,et al.  It all starts with entities: A Salient entity topic model , 2020, Nat. Lang. Eng..

[3]  Daniel Gillick,et al.  A New Entity Salience Task with Millions of Training Examples , 2014, EACL.

[4]  Nina Mishra,et al.  Domain bias in web search , 2012, WSDM '12.

[5]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[6]  Michael Gamon,et al.  Identifying salient entities in web pages , 2013, CIKM.

[7]  Oren Kurland,et al.  Document Retrieval Using Entity-Based Language Models , 2016, SIGIR.

[8]  Avishek Anand,et al.  Automated News Suggestions for Populating Wikipedia Entity Pages , 2015, CIKM.

[9]  Salvatore Orlando,et al.  SEL: A unified algorithm for salient entity linking , 2018, Comput. Intell..

[10]  Tomas Vitvar,et al.  Crowdsourced Corpus with Entity Salience Annotations , 2016, LREC.

[11]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[12]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[13]  Sebastian Hellmann,et al.  N³ - A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format , 2014, LREC.

[14]  Paolo Ferragina,et al.  Swat: A system for detecting salient Wikipedia entities in texts , 2018, Comput. Intell..

[15]  James Allan,et al.  Entity query feature expansion using knowledge base links , 2014, SIGIR.

[16]  Tie-Yan Liu,et al.  Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling , 2018, SIGIR.