Semantic Analysis and Automatic Corpus Construction for Entailment Recognition in Medical Texts

Textual Entailment Recognition (RTE) consists in detecting inference relationships between natural language sentences. It has a wide range of applications such as machine translation, question answering or text summarization. Significant interest has been brought to RTE with several challenges. However, most of current approaches are dedicated to open domains. The major challenge facing RTE in specialized domains is the lack of relevant training corpora and resources. In this paper we present an automatic corpus construction approach for RTE in the medical domain. We also quantify the impact of using (open-)domain RDF datasets on supervised learning based RTE. We evaluate the relevance of our corpus construction method by comparing the results obtained by an efficient memory based learning algorithm on PASCAL RTE corpora and on our automatically constructed corpus. The results show an accuracy increase of +6 to +28% and an improvement of +8 to +23% in terms of F-measure. We also found that semantic annotations from large open-domain datasets increased F1 score by 6%, while smaller medical RDF datasets actually decreased the overall performance. We discuss these findings and give some pointers to future investigations.