Template Sampling for Leveraging Domain Knowledge in Information Extraction

We initially describe a feature-rich discriminative Conditional Random Field ( CRF) model for Information Extraction in the workshop announcements domain, which offers good baseline performance in the PASCAL shared task. We then propose a method for leveraging domain knowledge in Information Extraction tasks, scoring candidate document labellings as one-value-per-field templates according to domain feasibility after generating sample labellings from a trained sequence classifier. Our relational models evaluate these templates according to our intuitions about agreement in the domain: workshop acronyms should resemble their names, workshop dates occur after paper submission dates. These methods see a 5% f-score improvement in fields retrieved when sampling labellings from a Maximum-Entropy Markov Model, however we do not observe improvement over a CRF model. We discuss reasons for this, including the problem of recovering all field instances from a best template, and propose future work in adapting such a model to the CRF, a better standalone system.