Optimization issues in inverted index-based entity annotation

Entity annotation is emerging as a key enabling requirement for search based on deeper semantics: for example, a search on 'John's address', that returns matches to all entities annotated as an address that co-occur with 'John'. A dominant paradigm adopted by rule-based named entity annotators is to annotate a document at a time. The complexity of this approach varies linearly with the number of documents and the cost for annotating each document, which could be prohibiting for large document corpora. A recently proposed alternative paradigm for rule-based entity annotation [16], operates on the inverted index of a document collection and achieves an order of magnitude speed-up over the document-based counterpart. In addition the index based approach permits collection level optimization of the order of index operations required for the annotation process. It is this aspect that is explored in this paper. We develop a polynomial time algorithm that, based on estimated cost, can optimally select between different logically equivalent evaluation plans for a given rule. Additionally, we prove that this problem becomes NP-hard when the optimization has to be performed over multiple rules and provide effective heuristics for handling this case. Our empirical evaluations show a speed-up factor upto 2 over the baseline system without optimizations.

[1]  Hugh E. Williams,et al.  What's Next? Index Structures for Efficient Phrase Querying , 1999, Australasian Database Conference.

[2]  Douglas E. Appelt,et al.  SRI International FASTUS SystemMUC-6 Test Results and Analysis , 1995, MUC.

[3]  Eser Kandogan,et al.  Avatar semantic search: a database approach to information retrieval , 2006, SIGMOD Conference.

[4]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[5]  Gaston H. Gonnet,et al.  Fast text searching for regular expressions or automaton searching on tries , 1996, JACM.

[6]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[7]  Abraham Silberschatz,et al.  Database Systems Concepts , 1997 .

[8]  Soumen Chakrabarti,et al.  Optimizing scoring functions and indexes for proximity search in type-annotated corpora , 2006, WWW '06.

[9]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[10]  Shivakumar Venkataraman,et al.  Cost-based optimization of decision support queries using transient-views , 1998, SIGMOD '98.

[11]  Ganesh Ramakrishnan,et al.  Entity Annotation based on Inverse Index Operations , 2006, EMNLP.

[12]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[13]  Yorick Wilks,et al.  Named Entity Recognition from Diverse Text Types , 2001 .

[14]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[15]  Abhi Shelat,et al.  Approximating the smallest grammar: Kolmogorov complexity in natural models , 2002, STOC '02.

[16]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[17]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[18]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[19]  Hugh E. Williams,et al.  Efficient phrase querying with an auxiliary index , 2002, SIGIR '02.

[20]  W. Bruce Croft,et al.  Optimization strategies for complex queries , 2005, SIGIR '05.