Using Machine Learning to Aid the Interpretation of Urine Steroid Profiles.

BACKGROUND Urine steroid profiles are used in clinical practice for the diagnosis and monitoring of disorders of steroidogenesis and adrenal pathologies. Machine learning (ML) algorithms are powerful computational tools used extensively for the recognition of patterns in large data sets. Here, we investigated the utility of various ML algorithms for the automated biochemical interpretation of urine steroid profiles to support current clinical practices. METHODS Data from 4619 urine steroid profiles processed between June 2012 and October 2016 were retrospectively collected. Of these, 1314 profiles were used to train and test various ML classifiers' abilities to differentiate between "No significant abnormality" and "?Abnormal" profiles. Further classifiers were trained and tested for their ability to predict the specific biochemical interpretation of the profiles. RESULTS The best performing binary classifier could predict the interpretation of No significant abnormality and ?Abnormal profiles with a mean area under the ROC curve of 0.955 (95% CI, 0.949-0.961). In addition, the best performing multiclass classifier could predict the individual abnormal profile interpretation with a mean balanced accuracy of 0.873 (0.865-0.880). CONCLUSIONS Here we have described the application of ML algorithms to the automated interpretation of urine steroid profiles. This provides a proof-of-concept application of ML algorithms to complex clinical laboratory data that has the potential to improve laboratory efficiency in a setting of limited staff resources.

[1]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[2]  W. Miller,et al.  The molecular biology, biochemistry, and physiology of human steroidogenesis and its disorders. , 2011, Endocrine reviews.

[3]  J W Honour,et al.  External quality assessment of urinary steroid profile analysis , 2004, Annals of clinical biochemistry.

[4]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[5]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[6]  Lucila Ohno-Machado,et al.  Generation of knowledge for clinical decision support: Statistical and machine learning techniques , 2014 .

[7]  Francis Eng Hock Tay,et al.  Financial Forecasting Using Support Vector Machines , 2001, Neural Computing & Applications.

[8]  M. New,et al.  Congenital adrenal hyperplasia. , 1988, Biochemical Society transactions.

[9]  Charles C. Driver,et al.  Continuous time structural equation modeling with R package ctsem , 2017 .

[10]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[11]  Tony Badrick,et al.  Higher Dimensions : Machine-Learning and Enhanced Prediction from Routine Clinical Chemistry Data , 2016 .

[12]  Wiebke Arlt,et al.  Congenital adrenal hyperplasia , 1966, The Lancet.

[13]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.

[14]  David J Handelsman,et al.  Urine and Serum Sex Steroid Profile in Testosterone-Treated Transgender and Hypogonadal and Healthy Control Men , 2018, The Journal of clinical endocrinology and metabolism.

[15]  Christian Böhm,et al.  Supervised machine learning techniques for the classification of metabolic disorders in newborns , 2004, Bioinform..

[16]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[17]  Sue Zaleski,et al.  Building a laboratory workforce to meet the future: ASCP Task Force on the Laboratory Professionals Workforce. , 2014, American journal of clinical pathology.

[18]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[19]  Marco Maggini,et al.  An expert system for the classification of serum protein electrophoresis patterns , 2008, Clinical chemistry and laboratory medicine.

[20]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[21]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[22]  Anand S Dighe,et al.  Detection of preanalytic laboratory testing errors using a statistically guided protocol. , 2012, American journal of clinical pathology.

[23]  Michael Biehl,et al.  Urine Steroid Metabolomics as a Biomarker Tool for Detecting Malignancy in Adrenal Tumors , 2011, The Journal of clinical endocrinology and metabolism.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[26]  Martial Saugy,et al.  Statistical discrimination of steroid profiles in doping control with support vector machines. , 2013, Analytica chimica acta.

[27]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[28]  Antonio Ciampi,et al.  Estimating risk of severe neonatal morbidity in preterm births under 32 weeks of gestation , 2018, The journal of maternal-fetal & neonatal medicine : the official journal of the European Association of Perinatal Medicine, the Federation of Asia and Oceania Perinatal Societies, the International Society of Perinatal Obstetricians.

[29]  Graham J. Williams,et al.  wsrf: An R Package for Classification with Scalable Weighted Subspace Random Forests , 2017 .

[30]  R. Dybowski,et al.  Towards a steroid-profiling expert system , 1988 .

[31]  V. Hasselblad,et al.  Effect of Clinical Decision-Support Systems , 2012, Annals of Internal Medicine.

[32]  Matteo Conti,et al.  Serum Steroid Ratio Profiles in Prostate Cancer: A New Diagnostic Tool Toward a Personalized Medicine Approach , 2018, Front. Endocrinol..

[33]  Anand S Dighe,et al.  Enhanced creatinine and estimated glomerular filtration rate reporting to facilitate detection of acute kidney injury. , 2015, American journal of clinical pathology.

[34]  Stefan R Bornstein,et al.  Congenital adrenal hyperplasia , 2005, The Lancet.

[35]  Peter Szolovits,et al.  Using Machine Learning to Predict Laboratory Test Results. , 2016, American journal of clinical pathology.

[36]  G Phillipou,et al.  INVESTIGATION OF URINARY STEROID PROFILES AS A DIAGNOSTIC METHOD IN CUSHING'S SYNDROME , 1982, Clinical endocrinology.

[37]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[38]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[40]  David A. Cowan,et al.  A new marker for early diagnosis of 21-hydroxylase deficiency: 3β,16α,17α-trihydroxy-5α-pregnane-7,20-dione , 2010, The Journal of Steroid Biochemistry and Molecular Biology.

[41]  João Maroco,et al.  Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests , 2011, BMC Research Notes.