An Exploration of Entity Models, Collective Classification and Relation Description

Traditional information retrieval typically represents data using a bag of words; data mining typically uses a highly structured database representation. This paper explores the middle ground using a representation which we term entity models, in which questions about structured data may be posed and answered, but the complexities and task-specific restrictions of ontologies are avoided. An entity model is a language model or word distribution associated with an entity, such as a person, place or organization. Using these perentity language models, entities may be clustered, links may be detected or described with a short summary, entities may be collectively classified, and question answering may be performed. On a corpus of entities extracted from newswire and the Web, we group entities by profession with 90% accuracy, improve accuracy further on the task of classifying politicians as liberal or conservative using collective classification and conditional random fields, and answer questions about “who a person is” with mean reciprocal rank (MRR) of 0.52.

[1]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[2]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[3]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[4]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[5]  Ellen M. Voorhees,et al.  The Eighth Text REtrieval Conference (TREC-8) , 2000 .

[6]  Jochen Dörre,et al.  Text mining: finding nuggets in mountains of textual data , 1999, KDD '99.

[7]  Michael Collins,et al.  Answer Extraction , 2000, ANLP.

[8]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[9]  Jack G. Conrad,et al.  A system for discovering relationships by feature extraction from text databases , 1994, SIGIR '94.

[10]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[11]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[12]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[13]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[14]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[15]  Yael Ravin,et al.  Identifying and extracting relations from text , 1999 .

[16]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[17]  Matthew Richardson,et al.  Mining the network value of customers , 2001, KDD '01.

[18]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[19]  Einat Amitay,et al.  Using common hypertext links to identify the best phrasal description of target web documents , 1998 .

[20]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[21]  Beth Sundheim Third Message Understanding Evaluation and Conference (MUC-3): Phase 1 Status Report , 1991, HLT.