A Data Mining Approach to the Diagnosis of Tuberculosis by Cascading Clustering and Classification

In this paper, a methodology for the automated detection and classification of Tuberculosis(TB) is presented. Tuberculosis is a disease caused by mycobacterium which spreads through the air and attacks low immune bodies easily. Our methodology is based on clustering and classification that classifies TB into two categories, Pulmonary Tuberculosis(PTB) and retroviral PTB(RPTB) that is those with Human Immunodeficiency Virus (HIV) infection. Initially K-means clustering is used to group the TB data into two clusters and assigns classes to clusters. Subsequently multiple different classification algorithms are trained on the result set to build the final classifier model based on K-fold cross validation method. This methodology is evaluated using 700 raw TB data obtained from a city hospital. The best obtained accuracy was 98.7% from support vector machine (SVM) compared to other classifiers. The proposed approach helps doctors in their diagnosis decisions and also in their treatment planning procedures for different categories.

[1]  Jung-Hsien Chiang,et al.  A Combination of Rough-Based Feature Selection and RBF Neural Network for Classification Using Gene Expression Data , 2008, IEEE Transactions on NanoBioscience.

[2]  Chin-Teng Lin,et al.  An EEG-based classification system of Passenger's motion sickness level by using feature extraction/selection technologies , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[3]  Chee Peng Lim,et al.  A Hybrid Neural Network System for Pattern Classification Tasks with Missing Features , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[5]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[6]  Ajith Abraham,et al.  Web usage mining using artificial ant colony clustering and linear genetic programming , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[7]  J. Okada,et al.  Multineuronal spike classification based on multisite electrode recording, whole-waveform analysis, and hierarchical clustering , 1999, IEEE Transactions on Biomedical Engineering.

[8]  Marc Sebban,et al.  A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis , 2002, Bioinform..

[9]  Yuan Zhang,et al.  Classifying DDoS Attacks by Hierarchical Clustering Based on Similarity , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[10]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[11]  T Asha,et al.  Notice of RetractionDiagnosis of tuberculosis using ensemble methods , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[12]  Nejat Yumusak,et al.  Tuberculosis Disease Diagnosis Using Artificial Neural Network Trained with Genetic Algorithm , 2011, Journal of Medical Systems.

[13]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[14]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[15]  Nizar Bouguila,et al.  A Hybrid Feature Extraction Selection Approach for High-Dimensional Non-Gaussian Data Clustering , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Cheng-Seen Ho,et al.  Toward a hybrid data mining model for customer retention , 2007, Knowl. Based Syst..

[17]  Pei-Chann Chang,et al.  A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification , 2011, Appl. Soft Comput..

[18]  Fevzullah Temurtas,et al.  Tuberculosis Disease Diagnosis Using Artificial Neural Networks , 2010, Journal of Medical Systems.

[19]  Evor L. Hines,et al.  Classification of bacteria responsible for ENT and eye infections using the Cyranose system , 2002 .

[20]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[21]  Vir V. Phoha,et al.  K-Means+ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-Means Clustering and ID3 Decision Tree Learning Methods , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  Berkman Sahiner,et al.  Classification of malignant and benign masses based on hybrid ART2LDA approach , 1999, IEEE Transactions on Medical Imaging.

[23]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[24]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  Kwong-Sak Leung,et al.  Data Mining on DNA Sequences of Hepatitis B Virus , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Dimitrios I. Fotiadis,et al.  EEG Transient Event Detection and Classification Using Association Rules , 2006, IEEE Transactions on Information Technology in Biomedicine.