Ensemble based optimal classification model for pre-diagnosis of lung cancer

Lung cancer is, one of the groups of malignant diseases affecting the Lung and associated organs. Pre-diagnosis is an important stage of identifying the target group of persons who can undergo diagnosis stage. In this study, a model is proposed based on ensemble of classifiers for prediction of lung cancer based on symptoms and risk factors. Data mining approach is adopted here, to develop model for system study. Data collection is carried out based on medically confirmed and diagnosed patient cases. Collected data is fed into data acceptance procedure for data outlier elimination, removal of insignificant data and noise. Data approved of the previous stage is pre-processed based on multi filter approach. Pre-processed data is then guided in to classifier algorithms which are rule, logic, conditional probability and neural network based approaches. Performance parameters and Confusion matrices are obtained for the individual algorithms based on both cross validation and Training set approaches. Based on the Reader Operator Characteristics (ROC) performance, error statistics and Confusion matrix, short listing of classification algorithms is carried out. It has been observed that training set based approach generally given better performance compared to cross validation approaches. Based on the error statistics, refinement process is carried out, thereby effectively bringing down the number of classifiers. From this study it has been observed that Sequential minimal optimization, Multi-Layer Perceptron, Instance based Learning on K-Nearest neighbor, Logistic, Random-Forest, Multiclass Classifier, Logit-Boost and Random Tree classifier algorithms have given consistent better performances Compared to others. Feature set extraction is then carried out based on Correlation Feature Selection (CFS) subset selection method under different search criteria, to reduce the dimension of the attributes... Feature set selection resulted in the reduction of dimensionality from 76 dimensions to. 20. An optimal model algorithm is developed by ensemble of classifier algorithms under supervised training approach. This models outcome class labels are validated only if all the prediction classifiers give the same consistent result. Some of the salient features observed in this study are: Unintentional weight loss, Pain in the parts of the body, Specific symptoms of Lung cancer [Coughing up blood (heamoptysis) or bloody mucus, Experience of Chest, shoulder, or back pain, Increase in volume of sputum, Wheezing problem, Shortness of breath] and risk factors like age at the time of diagnosis, Beedi smoking, consumption of country liquor/toddy, consumption of Brandy, Exposure to the sunlight for long duration and close relatives suffering with the caner played a major role in the prediction of the outcome class label.

[1]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[2]  Tim Oates,et al.  Large Datasets Lead to Overly Complex Models: An Explanation and a Solution , 1998, KDD.

[3]  A. Gazdar,et al.  Early detection of lung cancer: clinical perspectives of recent advances in biology and radiology. , 2001, Clinical cancer research : an official journal of the American Association for Cancer Research.

[4]  Mei Wang,et al.  Application of Hybrid Genetic Algorithm-BP Neural Networks to Diagnosis of Lung Cancer , 2008, 2008 International Conference on Computer Science and Software Engineering.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[7]  Youmin Guo,et al.  The Diagnostic Rules of Peripheral Lung Cancer Preliminary Study Based on Data Mining Technique , 2007 .

[8]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[9]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[10]  Ping Yang,et al.  Model based user interface design for predicting lung cancer treatment outcomes , 2011, 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[11]  K. Kancherla,et al.  Non Intrusive and Extremely Early Detection of Lung Cancer Using TCPP , 2009, 2009 Fourth International Multi-Conference on Computing in the Global Information Technology.

[12]  Cancer Research in ICMR Achievements in Nineties , 2000 .

[13]  Allan P. White,et al.  Machine learning techniques in early screening for gastric and oesophageal cancer , 1996, Artif. Intell. Medicine.

[14]  Kemal Polat,et al.  Computer aided medical diagnosis system based on principal component analysis and artificial immune recognition system classifier algorithm , 2008, Expert Syst. Appl..

[15]  Vladislav Rajkovic,et al.  Applications of qualitative multi-attribute decision models in health care , 2000, Int. J. Medical Informatics.