One Size Does Not Fit All: An Ensemble Approach Towards Information Extraction from Adverse Drug Event Narratives

Recognizing named entities in Adverse Drug Reactions narratives is a fundamental step towards extracting valuable patient information from unstructured text into a structured thus actionable format. This then unlocks advanced data analytics towards intelligent pharmacovigilance. Yet existing biomedical named entity recognition (NER) tools are limited in their ability to identify certain entity types from these domain-specific narratives and result in significant performance differences in terms of accuracy. To address these challenges, we propose an ensemble approach that integrates a rich variety of named entity recognizers to procure the final result. First, one critical problem faced by NER in the biomedical context is that the data is highly skewed. That is, only 1% of words belong to a certain medical entity type, such as, the reason for medication usage compared to all other non-reason words. We propose a balanced, under-sampled bagging strategy that is dependent on the imbalance level to overcome the class imbalance problem. Second, we present an ensemble of heterogeneous recognizers approach that leverages a novel ensemble combiner. Our experimental results show that for biomedical text datasets: (i) a balanced learning environment along with an Ensemble of Heterogeneous Classifiers constantly improves the performance over individual base learners and, (ii) stacking-based ensemble combiner methods outperform simple Majority Voting by 0.30 F-measure.

[1]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[2]  Yen S. Low,et al.  Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art , 2014, Drug Safety.

[3]  Hoang Nguyen,et al.  Text Mining in Clinical Domain: Dealing with Noise , 2016, KDD.

[4]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[5]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[6]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[7]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[8]  Fei Xia,et al.  Community Annotation Experiment Community Annotation Experiment for Ground Truth Generation for the i 2 b 2 Medication Challenge , 2010 .

[9]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[10]  Jerzy Stefanowski,et al.  Extending Bagging for Imbalanced Data , 2013, CORES.

[11]  Özlem Uzuner,et al.  Machine learning and rule-based approaches to assertion classification. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[13]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[14]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[15]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[16]  Son Doan,et al.  Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine , 2010, COLING.

[17]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[18]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[19]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[20]  Fei Xia,et al.  Community annotation experiment for ground truth generation for the i2b2 medication challenge , 2010, J. Am. Medical Informatics Assoc..

[21]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[22]  Hong Yu,et al.  Automatically Recognizing Medication and Adverse Event Information From Food and Drug Administration’s Adverse Event Reporting System Narratives , 2014, JMIR medical informatics.

[23]  Hong Yu,et al.  Bidirectional RNN for Medical Event Detection in Electronic Health Records , 2016, NAACL.

[24]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[25]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[26]  Peer Bork,et al.  The SIDER database of drugs and side effects , 2015, Nucleic Acids Res..

[27]  Robi Polikar Ensemble learning , 2009, Scholarpedia.

[28]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[29]  Omid Ghiasvand,et al.  Disease Name Extraction from Clinical Text Using Conditional Random Fields , 2014 .

[30]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[31]  Lehana Thabane,et al.  Application of data mining techniques in pharmacovigilance. , 2003, British journal of clinical pharmacology.

[32]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[33]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[34]  Fei Xia,et al.  A cascade of classifiers for extracting medication information from discharge summaries , 2011, J. Biomed. Semant..

[35]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[36]  N. S. Bhutada,et al.  Assessing Pancreatic Cancer Risk Associated with Dipeptidyl Peptidase 4 Inhibitors: Data Mining of FDA Adverse Event Reporting System (FAERS) , 2013 .

[37]  P. Corey,et al.  Incidence of Adverse Drug Reactions in Hospitalized Patients , 2012 .