The impact of feature selection on medical document classification

Medical document classification is still one of the popular research problems inside text classification domain. Apart from some text data compiled from hospital records, most of the researchers in this field evaluate their classification methodologies on documents retrieved from MEDLINE database. OHSUMED is one of the widely used datasets containing MEDLINE documents as multi-labeled. In this study, the impact of feature selection on medical document classification is analyzed using two datasets containing MEDLINE documents. The performances of two different feature selection methods namely Gini Index and Distinguishing Feature Selector are analyzed using two pattern classifiers. These pattern classifiers are Bayesian network and C4.5 decision tree. As this study deals with single-label classification, a subset of documents inside OHSUMED and a self-constructed dataset is used for assessment of feature selection methods. Due to having low amount of documents for some categories in self-compiled dataset, only documents belonging to 10 different disease categories are used in the experiments for both datasets. Experimental results show that the combination of Distinguishing Feature Selector and Bayesian Network classifier gives more accurate results in most cases than the others.

[1]  Christian Gütl,et al.  Multi-label Text Classification of German Language Medical Documents , 2007, MedInfo.

[2]  Marek Reformat,et al.  Multilabel associative classification categorization of MEDLINE articles into MeSH keywords. , 2007, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[3]  Cynthia Brandt,et al.  Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management , 2013, J. Biomed. Informatics.

[4]  Alan F. Smeaton,et al.  Ontology-Based MEDLINE Document Classification , 2007, BIRD.

[5]  Phayung Meesad,et al.  Ontology-supported processing of clinical text using medical knowledge integration for multi-label classification of diagnosis coding , 2010, ArXiv.

[6]  Jamshid Beheshti,et al.  A hidden Markov model-based text classification of medical documents , 2009, J. Inf. Sci..

[7]  Serkan Günal,et al.  Text classification using genetic algorithm oriented latent semantic features , 2014, Expert Syst. Appl..

[8]  Svetla Boytcheva,et al.  Automatic Matching of ICD-10 codes to Diagnoses in Discharge Letters , 2011 .

[9]  Damla Arifoglu,et al.  CodeMagic: Semi-Automatic Assignment of ICD-10-AM Codes to Patient Records , 2014, ISCIS.

[10]  Sébastien Fournier,et al.  Semantic Enrichments in Text Supervised Classification: Application to Medical Domain , 2014, FLAIRS Conference.

[11]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[12]  Serkan Günal,et al.  Detection of SMS spam messages on mobile phones , 2012, 2012 20th Signal Processing and Communications Applications Conference (SIU).

[13]  Russ B. Altman,et al.  MScanner: a classifier for retrieving Medline citations , 2008, BMC Bioinformatics.

[14]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[15]  Christopher G. Chute,et al.  Research Paper: Automating the Assignment of Diagnosis Codes to Patient Encounters Using Example-based and Machine Learning Techniques , 2006, J. Am. Medical Informatics Assoc..

[16]  《中华放射肿瘤学杂志》编辑部 Medline , 2001, Current Biology.

[17]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[18]  M. de Rijke,et al.  An Experiment in Automatic Classification of Pathological Reports , 2007, AIME.

[19]  Selma Ayse Ozel A Web page classification system based on a genetic algorithm using tagged-terms as features , 2011 .

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  Antonio Jimeno-Yepes,et al.  Feature engineering for MEDLINE citation categorization with MeSH , 2015, BMC Bioinformatics.

[22]  Alper Kursat Uysal,et al.  Classification of medical documents according to diseases , 2015, 2015 23nd Signal Processing and Communications Applications Conference (SIU).

[23]  Xindong Wu,et al.  Authorship identification from unstructured texts , 2014, Knowl. Based Syst..

[24]  Wanda Pratt,et al.  The Effect of Feature Representation on MEDLINE Document Classification , 2005, AMIA.

[25]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[26]  Stan Matwin,et al.  Exploiting the systematic review protocol for classification of medical abstracts , 2011, Artif. Intell. Medicine.

[27]  Tapio Salakoski,et al.  Application of the EVEX resource to event extraction and network construction: Shared Task entry and result analysis , 2015, BMC Bioinformatics.

[28]  M. Aono,et al.  Ontology based Approach for Classifying Biomedical Text Abstracts , 2011 .

[29]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[30]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[31]  Ngoc Thanh Nguyen,et al.  A combined negative selection algorithm-particle swarm optimization for an email spam detection system , 2015, Eng. Appl. Artif. Intell..