Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data

Newborn screening programs for severe metabolic disorders using tandem mass spectrometry are widely used. Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) is the most prevalent mitochondrial fatty acid oxidation defect (1:15,000 newborns) and it has been proven that early detection of this metabolic disease decreases mortality and improves the outcome. In previous studies, data mining methods on derivatized tandem MS datasets have shown high classification accuracies. However, no machine learning methods currently have been applied to datasets based on non-derivatized screening methods. A dataset with 44,159 blood samples was collected using a non-derivatized screening method as part of a systematic newborn screening by the PCMA screening center (Belgium). Twelve MCADD cases were present in this partially MCADD-enriched dataset. We extended three data mining methods, namely C4.5 decision trees, logistic regression and ridge logistic regression, with a parameter and threshold optimization method and evaluated their applicability as a diagnostic support tool. Within a stratified cross-validation setting, a grid search was performed for each model for a wide range of model parameters, included variables and classification thresholds. The best performing model used ridge logistic regression and achieved a sensitivity of 100%, a specificity of 99.987% and a positive predictive value of 32% (recalibrated for a real population), obtained in a stratified cross-validation setting. These results were further validated on an independent test set. Using a method that combines ridge logistic regression with variable selection and threshold optimization, a significantly improved performance was achieved compared to the current state-of-the-art for derivatized data, while retaining more interpretability and requiring less variables. The results indicate the potential value of data mining methods as a diagnostic support tool.

[1]  D. Chace,et al.  Use of tandem mass spectrometry for multianalyte screening of dried blood specimens from newborns. , 2003, Clinical chemistry.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Thomas Wetter,et al.  Feature construction can improve diagnostic criteria for high-dimensional metabolic data in newborn screening for medium-chain acyl-CoA dehydrogenase deficiency. , 2007, Clinical chemistry.

[4]  Károly Vékey,et al.  Direct tandem mass spectrometric analysis of amino acids in dried blood spots without chemical derivatization for neonatal screening. , 2003, Rapid communications in mass spectrometry : RCM.

[5]  D. Chace,et al.  Rapid diagnosis of MCAD deficiency: quantitative analysis of octanoylcarnitine and other acylcarnitines in newborn blood spots by tandem mass spectrometry. , 1997, Clinical chemistry.

[6]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[7]  A. Fischer,et al.  Health in Action Open access, freely available online The European Rare Diseases Therapeutic Initiative A public–private partnership is promoting research on new treatments for rare diseases , 2022 .

[8]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[9]  Alan Craft,et al.  Neonatal screening for medium-chain acyl-CoA dehydrogenase deficiency , 2001, The Lancet.

[10]  Bridget Wilcken,et al.  Fatty acid oxidation disorders: outcome and long-term prognosis , 2010, Journal of Inherited Metabolic Disease.

[11]  W. Rizzo,et al.  Mitochondrial fatty-acid oxidation disorders. , 2008, Seminars in pediatric neurology.

[12]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[13]  Y. T. Chen,et al.  Medium-chain acyl-CoA dehydrogenase (MCAD) deficiency: diagnosis by acylcarnitine analysis in blood. , 1993, American journal of human genetics.

[14]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[15]  M. Morris,et al.  Screening for medium chain acyl-CoA dehydrogenase deficiency using electrospray ionisation tandem mass spectrometry , 1998, Archives of disease in childhood.

[16]  Christian Böhm,et al.  Modelling of classification rules on metabolic patterns including machine learning and expert knowledge , 2005, J. Biomed. Informatics.

[17]  Ronald J A Wanders,et al.  A method for quantitative acylcarnitine profiling in human skin fibroblasts using unlabelled palmitic acid: diagnosis of fatty acid oxidation disorders and differentiation between biochemical phenotypes of MCAD deficiency. , 2002, Biochimica et biophysica acta.

[18]  Christian Böhm,et al.  Supervised machine learning techniques for the classification of metabolic disorders in newborns , 2004, Bioinform..

[19]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.