Entity categorization over large document collections

Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (i) considering an entity's context across multiple documents containing it, and (ii) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approaches.

[1]  John Platt,et al.  Fast training of svms using sequential minimal optimization , 1998 .

[2]  Luis Gravano,et al.  Querying text databases for efficient information extraction , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  Oren Etzioni,et al.  Self-supervised Relation Extraction from the Web , 2006, ISMIS.

[5]  Douglas E. Appelt,et al.  Introduction to Information Extraction Technology , 1999, IJCAI 1999.

[6]  Eugene Agichtein Scaling Information Extraction to Large Document Collections , 2005, IEEE Data Eng. Bull..

[7]  Sunita Sarawagi,et al.  Scalable Information Extraction and Integration. , 2006 .

[8]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[9]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[10]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[11]  Oren Etzioni,et al.  Relational Web Search , 2006 .

[12]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[13]  G. Navarro,et al.  Flexible Pattern Matching in Strings: Approximate matching , 2002 .

[14]  Yonatan Aumann,et al.  TEG: a hybrid approach to information extraction , 2004, CIKM '04.

[15]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[16]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[17]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[18]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[19]  ChengXiang Zhai,et al.  A mixture model for contextual text mining , 2006, KDD '06.

[20]  Sunita Sarawagi,et al.  Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[21]  Oren Etzioni,et al.  A search engine for natural language applications , 2005, WWW '05.

[22]  Ronen Feldman,et al.  Self-supervised relation extraction from the Web , 2007, Knowledge and Information Systems.

[23]  Panagiotis G. Ipeirotis,et al.  Show me the money!: deriving the pricing power of product features by mining consumer reviews , 2007, KDD '07.

[24]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[25]  Eric Brill,et al.  Reducing the human overhead in text categorization , 2006, KDD '06.

[26]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .