Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules

OBJECTIVE This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. MATERIALS AND METHODS The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve. RESULTS The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B(3), MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set. DISCUSSION A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts. CONCLUSION Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https://sourceforge.net/projects/ohnlp/files/MedCoref.

[1]  Branimir Boguraev,et al.  Anaphora for Everyone: Pronominal Anaphora Resolution without a Parser , 1996, COLING.

[2]  Nianwen Xue,et al.  CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes , 2011, CoNLL Shared Task.

[3]  Shuying Shen,et al.  Evaluating the state of the art in coreference resolution for electronic medical records , 2012, J. Am. Medical Informatics Assoc..

[4]  Andrew McCallum,et al.  First-Order Probabilistic Models for Coreference Resolution , 2007, NAACL.

[5]  Yannick Versley,et al.  SemEval-2010 Task 1: Coreference Resolution in Multiple Languages , 2009, *SEMEVAL.

[6]  Pascal Denis,et al.  Specialized Models and Ranking for Coreference Resolution , 2008, EMNLP.

[7]  Vincent Ng,et al.  Unsupervised Models for Coreference Resolution , 2008, EMNLP.

[8]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[9]  Cristina Nicolae,et al.  BESTCUT: A Graph Algorithm for Coreference Resolution , 2006, EMNLP.

[10]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.

[11]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[12]  Wendy W. Chapman,et al.  Coreference resolution: A review of general methodologies and applications in the clinical domain , 2011, J. Biomed. Informatics.

[13]  Valentin I. Spitkovsky,et al.  From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency Parsing , 2010, NAACL.

[14]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[15]  Randolph A. Miller,et al.  Research Paper: Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents , 2009, J. Am. Medical Informatics Assoc..

[16]  Hongfang Liu,et al.  A study of abbreviations in the UMLS , 2001, AMIA.

[17]  Xiaoqiang Luo,et al.  The Impact of Morphological Stemming on Arabic Mention Detection and Coreference Resolution , 2005, SEMITIC@ACL.

[18]  Oi Yee Kwong,et al.  Natural Language Processing - IJCNLP 2004, First International Joint Conference, Hainan Island, China, March 22-24, 2004, Revised Selected Papers , 2005, IJCNLP.

[19]  Breck Baldwin,et al.  CogNIAC: high precision coreference with limited knowledge and linguistic resources , 1997 .

[20]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[21]  Eduard H. Hovy,et al.  BLANC: Implementing the Rand index for coreference evaluation , 2010, Natural Language Engineering.

[22]  Siddhartha Jonnalagadda,et al.  NEMO: Extraction and normalization of organization names from PubMed affiliations , 2010, Journal of Biomedical Discovery and Collaboration.

[23]  Pascal Denis,et al.  Joint Determination of Anaphoricity and Coreference Resolution using Integer Programming , 2007, NAACL.

[24]  Ruslan Mitkov,et al.  COMPARING PRONOUN RESOLUTION ALGORITHMS , 2007, Comput. Intell..

[25]  Clement J. McDonald,et al.  A Natural Language Processing System to Extract and Code Concepts Relating to Congestive Heart Failure from Chest Radiology Reports , 2006, AMIA.

[26]  Rashmi Prasad,et al.  Part-of-speech tagging for clinical text: wall or bridge between institutions? , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[27]  John Hale,et al.  A Statistical Approach to Anaphora Resolution , 1998, VLC@COLING/ACL.

[28]  Eneko Agirre,et al.  Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, SEW@NAACL-HLT 2009, Boulder, CO, USA, June 4, 2009 , 2009, SEW@NAACL-HLT.

[29]  Christopher G. Chute,et al.  Domain-specific language models and lexicons for tagging , 2005, J. Biomed. Informatics.

[30]  Carl Vogel,et al.  Proceedings of the 16th International Conference on Computational Linguistics , 1996, COLING 1996.

[31]  Joe Kesterson,et al.  Comparing methods for identifying pancreatic cancer patients using electronic data sources. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[32]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[33]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[34]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[35]  Siddhartha Jonnalagadda,et al.  Pooling annotated corpora for clinical concept extraction , 2013, J. Biomed. Semant..

[36]  Mirella Lapata,et al.  Proceedings of ACL-08: HLT , 2008 .

[37]  Olga Patterson,et al.  Document clustering of clinical narratives: a systematic study of clinical sublanguages. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[38]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[39]  Christopher G. Chute,et al.  The Enterprise Data Trust at Mayo Clinic: a semantically integrated warehouse of biomedical data , 2010, J. Am. Medical Informatics Assoc..

[40]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[41]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[42]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[43]  Siddhartha R. Jonnalagadda,et al.  Feasibility of pooling annotated corpora for clinical concept extraction , 2012, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[44]  Jian Su,et al.  An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming , 2008, ACL.

[45]  Elaine Rich,et al.  An Architecture for Anaphora Resolution , 1988, ANLP.

[46]  Vincent Ng,et al.  Supervised Models for Coreference Resolution , 2009, EMNLP.

[47]  Sunghwan Sohn,et al.  Drug side effect extraction from clinical narratives of psychiatry and psychology patients , 2011, J. Am. Medical Informatics Assoc..

[48]  Jian Su,et al.  Coreference Resolution Using Competition Learning Approach , 2003, ACL.

[49]  Wendy W. Chapman,et al.  Anaphoric reference in clinical reports: Characteristics of an annotated corpus , 2012, J. Biomed. Informatics.

[50]  Branimir Boguraev,et al.  Proceedings of a Workshop on Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts , 1997 .

[51]  Jian Su,et al.  Improving Noun Phrase Coreference Resolution by Matching Strings , 2004, IJCNLP.

[52]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[53]  Dingcheng Li,et al.  A Pronoun Anaphora Resolution System based on Factorial Hidden Markov Models , 2011, ACL.

[54]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[55]  Siddhartha Jonnalagadda,et al.  An Effective Approach to Biomedical Information Extraction with Limited Training Data , 2011, ArXiv.

[56]  Candace L. Sidner,et al.  Focusing for Interpretation of Pronouns , 1981, CL.