Classification of complete blood count and haemoglobin typing data by a C4.5 decision tree, a naïve Bayes classifier and a multilayer perceptron for thalassaemia screening

Abstract This article presents the classification of blood characteristics by a C4.5 decision tree, a naive Bayes classifier and a multilayer perceptron for thalassaemia screening. The aim is to classify eighteen classes of thalassaemia abnormality, which have a high prevalence in Thailand, and one control class by inspecting data characterised by a complete blood count (CBC) and haemoglobin typing. Two indices namely a haemoglobin concentration (HB) and a mean corpuscular volume (MCV) are the chosen CBC attributes. On the other hand, known types of haemoglobin from six ranges of retention time identified via high performance liquid chromatography (HPLC) are the chosen haemoglobin typing attributes. The stratified 10-fold cross-validation results indicate that the best classification performance with average accuracy of 93.23% (standard deviation = 1.67%) and 92.60% (standard deviation = 1.75%) is achieved when the naive Bayes classifier and the multilayer perceptron are respectively applied to samples which have been pre-processed by attribute discretisation. The results also suggest that the HB attribute is redundant. Moreover, the achieved classification performance is significantly higher than that obtained using only haemoglobin typing attributes as classifier inputs. Subsequently, the naive Bayes classifier and the multilayer perceptron are applied to an additional data set in a clinical trial which respectively results in accuracy of 99.39% and 99.71%. These results suggest that a combination of CBC and haemoglobin typing analysis with a naive Bayes classifier or a multilayer perceptron is highly suitable for automatic thalassaemia screening.

[1]  M Stefanelli,et al.  A performance evaluation of the expert system ANEMIA. , 1988, Computers and biomedical research, an international journal.

[2]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[3]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[4]  G. Kaiafa,et al.  Discrimination indices as screening tests for β-thalassemic trait , 2007, Annals of Hematology.

[5]  J. Old,et al.  Screening and genetic diagnosis of haemoglobin disorders. , 2003, Blood reviews.

[6]  A. Gorakshakar,et al.  HPLC studies in hemoglobinopathies , 2007, Indian journal of pediatrics.

[7]  S. Sirichotiyakul,et al.  Prenatal Diagnosis of β-Thalassemia/Hb E by Hemoglobin Typing Compared to DNA Analysis , 2009, Hemoglobin.

[8]  Nachol Chaiyaratana,et al.  Thalassaemia classification by neural networks and genetic programming , 2007, Inf. Sci..

[9]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[10]  Georgios Paliouras,et al.  Personalizing Web Directories with the Aid of Web Usage Data , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Alla Joutovsky,et al.  HPLC retention time as a diagnostic tool for hemoglobin variants and hemoglobinopathies: a study of 60000 samples in a clinical diagnostic laboratory. , 2004, Clinical chemistry.

[12]  D J Weatherall,et al.  The thalassemia syndromes. , 2016, Texas reports on biology and medicine.

[13]  G. Ntaios,et al.  Discrimination indices as screening tests for β-thalassemic trait , 2007, Annals of Hematology.

[14]  Kemal Polat,et al.  Automated identification of diseases related to lymph system from lymphography data using artificial immune recognition system with fuzzy resource allocation mechanism (fuzzy-AIRS) , 2006, Biomed. Signal Process. Control..

[15]  M Stefanelli,et al.  ANEMIA: an expert consultation system. , 1986, Computers and biomedical research, an international journal.

[16]  S. Fucharoen,et al.  Hemoglobinopathies in Southeast Asia: molecular biology and clinical medicine. , 1997, Hemoglobin.

[17]  T. Higgins,et al.  Laboratory investigation of hemoglobinopathies and thalassemias: review and update. , 2000, Clinical chemistry.

[18]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[19]  Abdurrahman Kara,et al.  Most reliable indices in differentiation between thalassemia trait and iron deficiency anemia , 2002, Pediatrics international : official journal of the Japan Pediatric Society.

[20]  M Stefanelli,et al.  Classification of anaemia on the basis of ferrokinetic parameters , 1985, British journal of haematology.

[21]  K A Spackman,et al.  An expert system to diagnose anemia and report results directly on hematology forms. , 1996, Computers and biomedical research, an international journal.

[22]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[23]  B J Flehinger,et al.  HEME: a computer aid to diagnosis of hematologic disease. , 1976, Bulletin of the New York Academy of Medicine.

[24]  C. Ou,et al.  Diagnosis of hemoglobinopathies: electrophoresis vs. HPLC. , 2001, Clinica chimica acta; international journal of clinical chemistry.

[25]  Waranyu Wongseree,et al.  Classification of haemoglobin typing chromatograms by neural networks and decision trees for thalassaemia screening , 2009 .

[26]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[27]  P R Lund,et al.  Automated classification of anaemia using image analysis. , 1972, Lancet.

[28]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[29]  Giovanni Luca Christian Masala,et al.  A comparative study of K-Nearest Neighbour, Support Vector Machine and Multi-Layer Perceptron for Thalassemia screening , 2003 .

[30]  A. Chuansumrit,et al.  A scoring system for the classification of β‐thalassemia/Hb E disease severity , 2008, American journal of hematology.

[31]  C. Jiménez,et al.  New indices from the H*2 analyser improve differentiation between heterozygous β or δβ thalassaemia and iron‐deficiency anaemia , 1995 .

[32]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[33]  B. Golosio,et al.  A Real-Time Classification System of Thalassemic Pathologies Based on Artificial Neural Networks , 2002, Medical decision making : an international journal of the Society for Medical Decision Making.

[34]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[35]  M Stefanelli,et al.  NEOANEMIA: a knowledge-based system emulating diagnostic reasoning. , 1990, Computers and biomedical research, an international journal.

[36]  C. Jiménez,et al.  New indices from the H*2 analyser improve differentiation between heterozygous beta or delta beta thalassaemia and iron-deficiency anaemia. , 1995, Clinical and laboratory haematology.

[37]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .