Reddit Entity Linking Dataset

We introduce and make publicly available an entity linking dataset from Reddit that contains 17,316 linked entities, each annotated by three human annotators and then grouped into Gold, Silver, and Bronze to indicate inter-annotator agreement. We analyze the different errors and disagreements made by annotators and suggest three types of corrections to the raw data. Finally, we tested existing entity linking models that are trained and tuned on text from nonsocial media datasets. We find that, although these existing entity linking models perform very well on their original datasets, they perform poorly on this social media dataset. We also show that the majority of these errors can be attributed to poor performance on the mention detection subtask. These results indicate the need for better entity linking models that can be applied to the enormous amount of social media text.

[1]  Maria Glenski,et al.  Characterizing Speed and Scale of Cryptocurrency Discussion Spread on Reddit , 2019, WWW.

[2]  Mark Dredze,et al.  Entity Disambiguation for Knowledge Base Population , 2010, COLING.

[3]  Yueting Zhuang,et al.  Learning Dynamic Context Augmentation for Global Entity Linking , 2019, EMNLP.

[4]  Ming-Wei Chang,et al.  A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.

[5]  Mike Thelwall,et al.  She's Reddit: A source of statistically significant gendered interest information? , 2018, Inf. Process. Manag..

[6]  Raphaël Troncy,et al.  Analysis of named entity recognition and linking for tweets , 2014, Inf. Process. Manag..

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Nevena Lazic,et al.  Context-Dependent Fine-Grained Entity Type Tagging , 2014, ArXiv.

[9]  Ming-Wei Chang,et al.  Entity Linking on Microblogs with Spatial and Temporal Signals , 2014, TACL.

[10]  Neural Entity Linking: A Survey of Models based on Deep Learning , 2020, ArXiv.

[11]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[12]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[13]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[14]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[15]  Samuel Broscheit,et al.  Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking , 2019, CoNLL.

[16]  Thomas Hofmann,et al.  End-to-End Neural Entity Linking , 2018, CoNLL.

[17]  P. Resnik,et al.  CLPsych 2019 Shared Task: Predicting the Degree of Suicide Risk in Reddit Posts , 2019, Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology.

[18]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[19]  Yasumasa Onoe,et al.  Fine-Grained Entity Typing for Domain Independent Entity Linking , 2020, AAAI.

[20]  Olivier Raiman,et al.  DeepType: Multilingual Entity Linking by Neural Type System Evolution , 2018, AAAI.

[21]  Sang-Won Lee,et al.  Semantic network analysis for understanding user experiences of bipolar and depressive disorders on Reddit , 2019, Inf. Process. Manag..

[22]  Ivan Titov,et al.  Boosting Entity Linking Performance by Leveraging Unlabeled Documents , 2019, ACL.

[23]  Joydeep Chandra,et al.  Where should one get news updates: Twitter or Reddit , 2019, Online Soc. Networks Media.

[24]  Hiroyuki Shindo,et al.  LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention , 2020, EMNLP.

[25]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[26]  Wei Shen,et al.  An Attention Factor Graph Model for Tweet Entity Linking , 2018, WWW.

[27]  Zaiqing Nie,et al.  Joint Entity Recognition and Disambiguation , 2015, EMNLP.

[28]  Mike Conway,et al.  Examining thematic similarity, difference, and membership in three online mental health communities from reddit: A text mining and visualization approach , 2018, Comput. Hum. Behav..

[29]  Tim Weninger,et al.  Open-World Knowledge Graph Completion , 2017, AAAI.

[30]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[31]  Ghazaleh Beigi,et al.  Twitter for Sparking a Movement, Reddit for Sharing the Moment: #metoo through the Lens of Social Media , 2018, ArXiv.

[32]  Dipankar Das,et al.  Changing Views: Persuasion Modeling and Argument Extraction from Online Discussions , 2019, Inf. Process. Manag..

[33]  Wei Shen,et al.  Linking named entities in Tweets with knowledge base via user interest modeling , 2013, KDD.

[34]  Yanan Cao,et al.  Joint Entity Linking with Deep Reinforcement Learning , 2019, WWW.

[35]  Krisztian Balog,et al.  REL: An Entity Linker Standing on the Shoulders of Giants , 2020, SIGIR.

[36]  Aidan Hogan,et al.  Fine-Grained Evaluation for Entity Linking , 2019, EMNLP/IJCNLP.

[37]  Ming-Wei Chang,et al.  Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.

[38]  Mark Dredze,et al.  Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation , 2016, SocialNLP@EMNLP.

[39]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[40]  Pasquale Lops,et al.  Semantics-Aware Content-Based Recommender Systems , 2014, Recommender Systems Handbook.

[41]  Ghazaleh Beigi,et al.  #metoo Through the Lens of Social Media , 2018, SBP-BRiMS.

[42]  Thomas Hofmann,et al.  Deep Joint Entity Disambiguation with Local Neural Attention , 2017, EMNLP.

[43]  P. Alam ‘O’ , 2021, Composites Engineering: An A–Z Guide.

[44]  Kathleen McKeown,et al.  Dreaddit: A Reddit Dataset for Stress Analysis in Social Media , 2019, EMNLP.

[45]  Jaewoong Choi,et al.  Social media analytics and business intelligence research: A systematic review , 2020, Inf. Process. Manag..

[46]  Mohammed J. Zaki,et al.  GraphFlow: Exploiting Conversation Flow with Graph Neural Networks for Conversational Machine Comprehension , 2019, IJCAI.

[47]  Gisele L. Pappa,et al.  Reddit Weight Loss Communities: Do They Have What It Takes for Effective Health Interventions? , 2018, 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[48]  Mikhail Khodak,et al.  A Large Self-Annotated Corpus for Sarcasm , 2017, LREC.

[49]  Kalina Bontcheva,et al.  Crowdsourcing Named Entity Recognition and Entity Linking Corpora , 2017 .

[50]  山田 育矢 Entity linking with a knowledge base(審査報告) , 2016 .

[51]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[52]  Ivan Titov,et al.  Improving Entity Linking by Modeling Latent Relations between Mentions , 2018, ACL.

[53]  Ming-Wei Chang,et al.  To Link or Not to Link? A Study on End-to-End Tweet Entity Linking , 2013, NAACL.

[54]  Kentaro Inui,et al.  Neural Architectures for Fine-grained Entity Type Classification , 2016, EACL.

[55]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[56]  P. Alam ‘G’ , 2021, Composites Engineering: An A–Z Guide.

[57]  Jeremy Blackburn,et al.  Analyzing Genetic Testing Discourse on the Web Through the Lens of Twitter, Reddit, and 4chan , 2020, ACM Trans. Web.

[58]  Manfred Stede,et al.  Anaphora Resolution for Twitter Conversations: An Exploratory Study , 2018 .

[59]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[60]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[61]  Hiroyuki Shindo,et al.  Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation , 2016, CoNLL.

[62]  Jiawei Han,et al.  Constructing Structured Information Networks from Massive Text Corpora , 2017, WWW.