Learning sufficient queries for entity filtering

Entity-centric document filtering is the task of analyzing a time-ordered stream of documents and emitting those that are relevant to a specified set of entities (e.g., people, places, organizations). This task is exemplified by the TREC Knowledge Base Acceleration (KBA) track and has broad applicability in other modern IR settings. In this paper, we present a simple yet effective approach based on learning high-quality Boolean queries that can be applied deterministically during filtering. We call these Boolean statements sufficient queries. We argue that using deterministic queries for entity-centric filtering can reduce confounding factors seen in more familiar "score-then-threshold" filtering methods. Experiments on two standard datasets show significant improvements over state-of-the-art baseline models.