论文信息 - Entity models for trigger-reaction documents

Entity models for trigger-reaction documents

We define the notion of an entity model for a special kind of document popular on the web: an article followed by a list of reactions on that article, usually by many authors, usually inverse chronologically ordered. We call these documents trigger-reactions pairs. The entity model describes which named entities (persons, organizations, locations, products, urls) are mentioned, their type, how often and where they are mentioned, and it lists all variants referring to the same entity. These models find applications in media-analysis, trend watching, entity tracking and marketing. The two main challenges for creating entity models are 1) detecting the entities and 2) normalizing all variants to the same correct canonical form. This task is particularly hard for user generated content on the web, of which our reactions are an example. We use an algorithm for named entity recognition and normalization (NEN) tailor-made for trigger-reaction documents. It achieves high recall and reasonable precision by using two simple facts: 1) incomplete entities in reactions often occur complete in the trigger and 2) entities mentioned in news-articles on the web often have a Wikipedia page. This article describes our experience in creating and using entity models on a corpus of 56,449 Dutch trigger-reaction documents, with a total of 616,715 reactions, collected from the web from November 11, 2006 to February 5, 2008. This paper accompanies an earlier article from our group in which the focus was on a systems-evaluation of the NEN algorithm.

Khalid | Maarten Marx | Marc X. Makkes

[1] Clement T. Yu,et al. Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature , 2007, SIGIR.

[2] Yang Song,et al. Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[3] Erik F. Tjong Kim Sang,et al. Memory-Based Named Entity Recognition , 2002, CoNLL.

[4] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5] Christine L. Borgman,et al. Getty's Synoname™ and its cousins: A survey of applications of personal name‐matching algorithms , 1992 .

[6] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .

[7] Aaron Cohen. Unsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted Dictionaries , 2005, LBLODMBS@IDMB.

[8] Walid Magdy,et al. Arabic Cross-Document Person Name Normalization , 2007, SEMITIC@ACL.

[9] Christine L. Borgman,et al. Getty's Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms , 1992, J. Am. Soc. Inf. Sci..

[10] DoanAnHai,et al. Semantic-integration research in the database community , 2005 .

[11] David Yarowsky,et al. Unsupervised Personal Name Disambiguation , 2003, CoNLL.