Instance-based classifiers applied to medical databases: Diagnosis and knowledge extraction

OBJECTIVE The aim of this paper is to study the feasibility and the performance of some classifier systems belonging to family of instance-based (IB) learning as second-opinion diagnostic tools and as tools for the knowledge extraction phase in the process of knowledge discovery in clinical databases. MATERIALS AND METHODS We consider three clinical databases: one relating to the differential diagnosis of erythemato-squamous diseases, the second to the diagnosis of the onset of diabetes mellitus and the third dealing with a problem of diagnostic imaging in nuclear cardiology. We apply five IB classifiers to each database; two are based on exemplars, one is based on prototypes and two are hybrid. One of the latter classifiers is a new classifier introduced here and is called prototype exemplar learning classifier (PEL-C). We use cross-validation techniques to evaluate and compare the performances of several classifier systems as diagnostic tools, considering indexes such as accuracy, sensitivity, specificity, and conciseness of class representations. Moreover we analyze the number and the type of instances that represent the diagnostic classes learnt by each classifier to evaluate and compare their knowledge extraction capabilities. RESULTS An examination of the experimental results shows that classifiers with the best classification performances are the optimized k-nearest neighbour classifier (k-NNC) and PEL-C. The k-NNC uses the highest number of representative instances, 100% of the entire database, whereas PEL-C uses a far lesser number of representative instances: equal, on the average, to the 3% of the database. As tools for knowledge extraction, we interpret the kind of class representations obtained by IB classifiers as a form of nosological knowledge. Additionally, we report the most interesting diagnostic class representations to be those extracted by PEL-C because they are composed of a mixture of abstracted prototypical cases (syndromes) and selected atypical clinical cases. CONCLUSION This study shows that IB methods - most notably, the optimized k-NNC and the PEL-C - can be used and may be advantageous for clinical decision support systems and that IB classifiers can be used for nosological knowledge extraction. Because PEL-C uses more compact and potentially meaningful class descriptions, it is preferable when the diagnostic problem at-hand needs smaller storage space or for knowledge extraction itself. The complexity and responsibility of diagnostic practice requires that these results be confirmed further within other clinical domains.

[1]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[2]  James C. Bezdek,et al.  Multiple-prototype classifier design , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[3]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[4]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[5]  Thomas Villmann,et al.  Prototype based fuzzy classification in clinical proteomics , 2008, Int. J. Approx. Reason..

[6]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[8]  Kenneth Revett,et al.  Evaluation of the Feature Space of an Erythematosquamous Dataset Using Rough Sets , 2009 .

[9]  GagliardiFrancesco Instance-based classifiers applied to medical databases , 2011 .

[10]  Giacomo Patrizi,et al.  Formal methods in pattern recognition: A review , 2000, Eur. J. Oper. Res..

[11]  Kazem Sadegh-Zadeh,et al.  Fundamentals of clinical methodology - 4. Diagnosis , 2000, Artif. Intell. Medicine.

[12]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[13]  Cor J. Veenman,et al.  The nearest subclass classifier: a compromise between the nearest mean and nearest neighbor classifier , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[15]  R. D'amico,et al.  Is disease a natural kind? , 1995, The Journal of medicine and philosophy.

[16]  David G. Stork,et al.  Pattern Classification , 1973 .

[17]  David DesJardins,et al.  Outliers , Inliers , and Just Plain Liars-- New Graphical EDA + ( EDA Plus ) Techniques for Understanding Data , 2022 .

[18]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[19]  Michael R. Berthold,et al.  Adaptive prototype-based fuzzy classification , 2008, Fuzzy Sets Syst..

[20]  Mohamed S. Kamel,et al.  A software package for interactive motor unit potential classification using fuzzy k-NN classifier , 2008, Comput. Methods Programs Biomed..

[21]  K. Cios Medical data mining and knowledge discovery. , 2000, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[22]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[23]  R. J. Henery,et al.  Methods for comparison , 1995 .

[24]  Terry Ngo,et al.  Data mining: practical machine learning tools and technique, third edition by Ian H. Witten, Eibe Frank, Mark A. Hell , 2011, SOEN.

[25]  K.J. Cios,et al.  From the guest editor medical data mining and knowledge discovery , 2000, IEEE Engineering in Medicine and Biology Magazine.

[26]  Joseph L. Breault,et al.  Data Mining Diabetic Databases: Are Rough Sets a Useful Addition? , 2001 .

[27]  Ronald K. Pearson,et al.  The problem of disguised missing data , 2006, SKDD.

[28]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[29]  James C. Bezdek,et al.  Nearest prototype classification: clustering, genetic algorithms, or random search? , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[30]  H. Altay Güvenir,et al.  Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals , 1998, Artif. Intell. Medicine.

[31]  Germund Hesslow,et al.  Do we need a concept of disease? , 1993, Theoretical medicine.

[32]  Joseph L. Breault,et al.  Data mining a diabetic data warehouse , 2002, Artif. Intell. Medicine.

[33]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[34]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[35]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[36]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[37]  Leif Østergaard,et al.  Applying instance-based techniques to prediction of final outcome in acute stroke , 2005, Artif. Intell. Medicine.

[38]  Lukasz A. Kurgan,et al.  Knowledge discovery approach to automated cardiac SPECT diagnosis , 2001, Artif. Intell. Medicine.

[39]  Ian Witten,et al.  Data Mining , 2000 .

[40]  G. Giorello,et al.  Tra diagnosi e scoperta. Una rilettura del caso Semmelweis , 2004 .

[41]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[42]  Rich Caruana,et al.  Benefitting from the Variables that Variable Selection Discards , 2003, J. Mach. Learn. Res..

[43]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[44]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[45]  Sanford Weisberg,et al.  Computing science and statistics : proceedings of the 30th Symposium on the Interface, Minneapolis, Minnesota, May 13-16, 1998 : dimension reduction, computational complexity and information , 1998 .

[46]  Kemal Polat,et al.  Medical diagnosis of atherosclerosis from Carotid Artery Doppler Signals using principal component analysis (PCA), k-NN based weighting pre-processing and Artificial Immune Recognition System (AIRS) , 2008, J. Biomed. Informatics.

[47]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[48]  Ralph Martinez,et al.  Reduction Techniques for Exemplar-Based Learning Algorithms , 1998 .

[49]  B Hofmann,et al.  Complexity of the Concept of Disease as Shown Through Rival Theoretical Frameworks , 2001, Theoretical medicine and bioethics.

[50]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[51]  Bart Wyns,et al.  Prediction of diagnosis in patients with early arthritis using a combined Kohonen mapping and instance-based evaluation criterion , 2004, Artif. Intell. Medicine.

[52]  K Sadegh-Zadeh,et al.  Fuzzy health, illness, and disease. , 2000, The Journal of medicine and philosophy.

[53]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[54]  Kazem Sadegh-Zadeh Fundamentals of clinical methodology: 3. Nosology , 1999, Artif. Intell. Medicine.

[55]  Michael C. Lee,et al.  Supervised Pattern Recognition for the Prediction of Contrast-enhancement Appearance in Brain Tumors from Multivariate Magnetic Resonance Imaging and Spectroscopy § , 2008 .