Entity models for trigger-reaction documents

We define the notion of an entity model for a special kind of document popular on the web: an article followed by a list of reactions on that article, usually by many authors, usually inverse chronologically ordered. We call these documents trigger-reactions pairs. The entity model describes which named entities (persons, organizations, locations, products, urls) are mentioned, their type, how often and where they are mentioned, and it lists all variants referring to the same entity. These models find applications in media-analysis, trend watching, entity tracking and marketing. The two main challenges for creating entity models are 1) detecting the entities and 2) normalizing all variants to the same correct canonical form. This task is particularly hard for user generated content on the web, of which our reactions are an example. We use an algorithm for named entity recognition and normalization (NEN) tailor-made for trigger-reaction documents. It achieves high recall and reasonable precision by using two simple facts: 1) incomplete entities in reactions often occur complete in the trigger and 2) entities mentioned in news-articles on the web often have a Wikipedia page. This article describes our experience in creating and using entity models on a corpus of 56,449 Dutch trigger-reaction documents, with a total of 616,715 reactions, collected from the web from November 11, 2006 to February 5, 2008. This paper accompanies an earlier article from our group in which the focus was on a systems-evaluation of the NEN algorithm.

[1]  Clement T. Yu,et al.  Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature , 2007, SIGIR.

[2]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[3]  Erik F. Tjong Kim Sang,et al.  Memory-Based Named Entity Recognition , 2002, CoNLL.

[4]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5]  Christine L. Borgman,et al.  Getty's Synoname™ and its cousins: A survey of applications of personal name‐matching algorithms , 1992 .

[6]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[7]  Aaron Cohen Unsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted Dictionaries , 2005, LBLODMBS@IDMB.

[8]  Walid Magdy,et al.  Arabic Cross-Document Person Name Normalization , 2007, SEMITIC@ACL.

[9]  Christine L. Borgman,et al.  Getty's Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms , 1992, J. Am. Soc. Inf. Sci..

[10]  DoanAnHai,et al.  Semantic-integration research in the database community , 2005 .

[11]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[12]  Jihie Kim,et al.  Learning to Detect Conversation Focus of Threaded Discussions , 2006, NAACL.

[13]  Valentin Jijkoun,et al.  The Impact of Named Entity Normalization on Information Retrieval for Question Answering , 2008, ECIR.

[14]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[15]  Valentin Jijkoun,et al.  Named entity normalization in user generated content , 2008, AND '08.

[16]  Maarten de Rijke,et al.  Extracting the discussion structure in comments on news-articles , 2007, WIDM '07.

[17]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[18]  E. Hovy,et al.  Mining and Assessing Discussions on the Web through Speech Act Analysis , 2006 .

[19]  Gilad Mishne,et al.  Applied text analytics for blogs , 2007 .

[20]  Max Mühlhäuser,et al.  Automatically Assessing the Post Quality in Online Discussions on Software , 2007, ACL.

[21]  M. de Rijke,et al.  A Cascaded Machine Learning Approach to Interpreting Temporal Expressions , 2007, NAACL.

[22]  V. Rich Personal communication , 1989, Nature.

[23]  M. de Rijke,et al.  Discovering missing links in Wikipedia , 2005, LinkKDD '05.

[24]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.