A controlled greedy supervised approach for co-reference resolution on clinical text

Identification of co-referent entity mentions inside text has significant importance for other natural language processing (NLP) tasks (e.g. event linking). However, this task, known as co-reference resolution, remains a complex problem, partly because of the confusion over different evaluation metrics and partly because the well-researched existing methodologies do not perform well on new domains such as clinical records. This paper presents a variant of the influential mention-pair model for co-reference resolution. Using a series of linguistically and semantically motivated constraints, the proposed approach controls generation of less-informative/sub-optimal training and test instances. Additionally, the approach also introduces some aggressive greedy strategies in chain clustering. The proposed approach has been tested on the official test corpus of the recently held i2b2/VA 2011 challenge. It achieves an unweighted average F1 score of 0.895, calculated from multiple evaluation metrics (MUC, B(3) and CEAF scores). These results are comparable to the best systems of the challenge. What makes our proposed system distinct is that it also achieves high average F1 scores for each individual chain type (Test: 0.897, Person: 0.852, PROBLEM: 0.855, TREATMENT: 0.884). Unlike other works, it obtains good scores for each of the individual metrics rather than being biased towards a particular metric.

[1]  Jian Su,et al.  An NP-Cluster Based Approach to Coreference Resolution , 2004, COLING.

[2]  Siddhartha Jonnalagadda,et al.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules , 2012, J. Am. Medical Informatics Assoc..

[3]  Chen Lin,et al.  A system for coreference resolution for the clinical narrative , 2012, J. Am. Medical Informatics Assoc..

[4]  Leon Derczynski,et al.  TIMEN: An Open Temporal Expression Normalisation Resource , 2012, LREC.

[5]  Fan Zhang,et al.  Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining , 2011, ACL.

[6]  Sanda M. Harabagiu,et al.  A supervised framework for resolving coreference in clinical records , 2012, J. Am. Medical Informatics Assoc..

[7]  Michael Strube,et al.  The Influence of Minimum Edit Distance on Reference Resolution , 2002, EMNLP.

[8]  Peter Szolovits,et al.  MCORES: a system for noun phrase coreference resolution for clinical records , 2012, J. Am. Medical Informatics Assoc..

[9]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[10]  Pascal Denis,et al.  Specialized Models and Ranking for Coreference Resolution , 2008, EMNLP.

[11]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[12]  Shuying Shen,et al.  Evaluating the state of the art in coreference resolution for electronic medical records , 2012, J. Am. Medical Informatics Assoc..

[13]  Vincent Ng,et al.  Supervised Noun Phrase Coreference Research: The First Fifteen Years , 2010, ACL.

[14]  Oussama El-Rawas,et al.  Machine learning-based coreference resolution of concepts in clinical documents , 2012, J. Am. Medical Informatics Assoc..

[15]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[16]  Vincent Ng,et al.  Unsupervised Models for Coreference Resolution , 2008, EMNLP.

[18]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[19]  Tian Ye He,et al.  Coreference resolution on entities and events for hospital discharge summaries , 2007 .

[20]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[21]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[22]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[23]  Hua Xu,et al.  An initial study of full parsing of clinical text using the Stanford Parser , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[24]  Wendy G. Lehnert,et al.  Using Decision Trees for Coreference Resolution , 1995, IJCAI.

[25]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[26]  Wen-Lian Hsu,et al.  Coreference resolution of medical concepts in discharge summaries by exploiting contextual information , 2012, J. Am. Medical Informatics Assoc..

[27]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[28]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[29]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[30]  Yue Wang,et al.  A classification approach to coreference in discharge summaries: 2011 i2b2 challenge , 2012, J. Am. Medical Informatics Assoc..

[31]  Mark Stevenson Fact distribution in Information Extraction , 2006, Lang. Resour. Evaluation.

[32]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[33]  Abdul V. Roudsari,et al.  Lexical patterns, features and knowledge resources for coreference resolution in clinical notes , 2012, J. Biomed. Informatics.

[34]  Daniel M. Stein,et al.  Research paper: Quantifying clinical narrative redundancy in an electronic health record , 2010, J. Am. Medical Informatics Assoc..

[35]  Wendy W. Chapman,et al.  Coreference resolution: A review of general methodologies and applications in the clinical domain , 2011, J. Biomed. Informatics.

[36]  Hwee Tou Ng,et al.  Corpus-Based Learning for Noun Phrase Coreference Resolution , 1999, EMNLP.

[37]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[38]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[39]  Xiaoqiang Luo,et al.  A Mention-Synchronous Coreference Resolution Algorithm Based On the Bell Tree , 2004, ACL.

[40]  Scott Bennett,et al.  Evaluating Automated and Manual Acquisition of Anaphora Resolution Strategies , 1995, ACL.

[41]  Seung-won Hwang,et al.  Web scale taxonomy cleansing , 2011, Proc. VLDB Endow..