The University of Sheffield System at TAC KBP 2010

This paper describes the University of Sheffield’s entry in the 2010 TAC KBP entity linking and slot filling tasks. This was our first participation in the TAC KBP track. Given limited human resources and a relatively late decision to participate 1, we chose to view our participation as an exploratory effort, aimed at educating us in the issues surrounding the tasks. With that perspective we decided to first adopt a fairly naive approach, see where it went wrong, then refine the approach as time permitted. Our first “naive” approach to the entity linking (EL) task was to build a text collection from the textual description portion of the KB nodes, index this collection using a search engine tool, convert the EL query into a search engine query and return the top ranked KB node whose name matched the entity name in the query as the answer to the query, provided the similarity score between the query and the KB node exceeded some threshold. Analysis of the failures of this approach suggested that a major problem was the insistence that a KB node name must match the query entity name exactly. The rest of our effort on the EL task went into exploring how we could relax this assumption. Our first “naive” approach to the slot filling (SF) task was to treat it as a relation extraction task, which we tackled with a rule-based approach, given the shortage of training data and our limited development time and resource. We observed that the majority of slot values were one of the entity types person, organization, GPE or timex. We therefore chose to run a named entity recognition and classification (NERC) component that identified these entity types over the top ranked texts retrieved from the test corpus (which had been indexed previously by our search engine tool) using a query derived from the SF query name and associated document. For each slot a set of manually developed rules were applied to sentences containing the query entity name and another entity whose type indicated it was a candidate value for that slot. Candidate entities matched by the rules were returned as slot values. After implementing this approach little time was left for refinement. What limited time we had was spent analyzing what value of n should be chosen in selecting the top n documents returned in the retrieval stage for subsequent slot extraction. The rest of this paper describes our approach and related investigations in more detail. Section 2 briefly describes existing language processing tools which we took “off-the-shelf”, to reduce our development time and to allow us to concentrate on the most interesting aspects of the tasks. Sections 3 and 4 describe in detail our approaches to the EL and SF tasks respectively. Section 5 concludes the paper and discusses potential future work.