Building a classifier for identifying sentences pertaining to disease-drug relationships in tardive dyskinesia

In this paper, we attempt to build a pipeline that identifies and extracts disease-drug relationships via sentence classification, and demonstrate the feasibility and utility of our approach using tardive dyskinesia as a case study. We manually developed and annotated a biomedicai training corpus for tardive dyskinesia. Using 10-fold cross validation, we tested and trained a naïve Bayes classifier to identify sentences pertaining to disease-drug relationships. Our precision, recall, and F-measure were all approximately 66%, and area under the ROC curve was over 80%. Our method helps to elucidate various drug effects on tardive dyskinesia and constitutes an initial effort toward the task of disease-drug relationship extraction.

[1]  M. Rivera,et al.  Analysis of genomic and proteomic data using advanced literature mining. , 2003, Journal of proteome research.

[2]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  W. Kalow Human pharmacogenomics: The development of a science , 2004, Human Genomics.

[5]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[6]  David L Veenstra,et al.  Expectations, validity, and reality in pharmacogenetics. , 2010, Journal of clinical epidemiology.

[7]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[8]  Michelle Whirl-Carrillo,et al.  From pharmacogenomic knowledge acquisition to clinical applications: the PharmGKB as a clinical pharmacogenomic biomarker resource. , 2011, Biomarkers in medicine.

[9]  Yael Garten,et al.  Recent progress in automatically extracting information from the pharmacogenomic literature. , 2010, Pharmacogenomics.

[10]  Thomas Werner,et al.  LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts , 2005, Nucleic Acids Res..

[11]  Ralf Herwig,et al.  Expression profiling of drug response - from genes to pathways , 2006, Dialogues in clinical neuroscience.

[12]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[13]  P. V. van Harten,et al.  Tardive dyskinesia: clinical presentation and treatment. , 2011, International review of neurobiology.

[14]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[15]  Hagit Shatkay,et al.  Pacific Symposium on Biocomputing 13:604-615(2008) EPILOC: A (WORKING) TEXT-BASED SYSTEM FOR PREDICTING PROTEIN SUBCELLULAR LOCATION , 2022 .

[16]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[17]  T. Huizinga,et al.  Understanding the genetic contribution to rheumatoid arthritis , 2005, Current opinion in rheumatology.

[18]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[19]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[20]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[21]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.