Best practices for supervised machine learning when examining biomarkers in clinical populations

Abstract Machine learning approaches are increasingly used in health research. Applications range from the identification of disease onset, classification of disease severity, to predicting epileptic seizures. Although machine learning can be a powerful tool, there is potential for misuse; model performance can be inflated through overfitting and, consequently, will not generalize to the greater population. The risk of misuse increases when the number of variables extracted from continuous data is almost unlimited, as is the case for neural, movement, and acoustic (e.g., speech and music) data. Given that health research may contain small sample sizes, and outcome variables can be noisier for clinical populations, there are important points that should be considered before using machine learning. We suggest best practices in machine learning including data formatting, reducing data dimensionality, model selection and evaluation, and other steps within the machine learning process. We further discuss some common pitfalls in applying machine learning to small sample sizes and high-dimensional data (e.g., speech biomarkers, neural and imaging data). We advocate for parsimonious approaches that include selecting the simplest machine learning method that best describes the data, preventing redundancy and overfitting through variable elimination, and ensuring that certain variables or approaches do not inflate machine learning outcomes. We further consider approaches that can identify the best predictors (or combinations thereof), as well as “black box” machine learning methods (e.g., deep learning). Finally, we discuss the limitations of current machine learning methods and pose future directions to broaden the applicability of machine learning tools and ensure the outcomes are robust against random factors.

[1]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[2]  Matthias Kohl,et al.  General Purpose Convolution Algorithm in S4 Classes by Means of FFT , 2010, 1006.0764.

[3]  Zhongheng Zhang,et al.  Variable selection with stepwise and best subset approaches. , 2016, Annals of translational medicine.

[4]  Andre Esteva,et al.  A guide to deep learning in healthcare , 2019, Nature Medicine.

[5]  Fernanda Farinelli,et al.  Linked Health Data: how linked data can help provide better health decisions , 2015, MedInfo.

[6]  Jeffrey M. Hausdorff,et al.  Model-based and Model-free Machine Learning Techniques for Diagnostic Prediction and Classification of Clinical Outcomes in Parkinson’s Disease , 2018, Scientific Reports.

[7]  P Peduzzi,et al.  Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. , 1995, Journal of clinical epidemiology.

[8]  D. Bloch,et al.  A simple method of sample size calculation for linear and logistic regression. , 1998, Statistics in medicine.

[9]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[10]  Yike Guo,et al.  Combining Multiple Feature Selection Methods and Deep Learning for High-dimensional Data , 2016, Trans. Mach. Learn. Data Min..

[11]  Surendra Shetty,et al.  A Survey on Machine Learning Approaches for Automatic Detection of Voice Disorders. , 2019, Journal of voice : official journal of the Voice Foundation.

[12]  C. Finan,et al.  Linear regression and the normality assumption. , 2017, Journal of clinical epidemiology.

[13]  Denis Larocque,et al.  Generalized mixed effects regression trees , 2010 .

[14]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[15]  Sven F. Crone,et al.  Cross-validation aggregation for combining autoregressive neural network forecasts , 2016 .

[16]  Yongtian He,et al.  Deep learning for electroencephalogram (EEG) classification tasks: a review , 2019, Journal of neural engineering.

[17]  A. Vogel,et al.  Voice in Friedreich Ataxia. , 2017, Journal of voice : official journal of the Voice Foundation.

[18]  Douglas G. Altman,et al.  No rationale for 1 variable per 10 events criterion for binary logistic regression analysis , 2016, BMC Medical Research Methodology.

[19]  James Bailey,et al.  AIC and BIC based approaches for SVM parameter value estimation with RBF kernels , 2012, ACML.

[20]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[21]  A. Meyer-Lindenberg,et al.  Machine Learning for Precision Psychiatry: Opportunities and Challenges. , 2017, Biological psychiatry. Cognitive neuroscience and neuroimaging.

[22]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[23]  Christian Janze Shedding Light on the Role of Sample Sizes and Splitting Proportions in Out-of-Sample Tests: A Monte Carlo Cross-Validation Approach , 2017 .

[24]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[25]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[26]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[27]  Madalina Fiterau,et al.  Machine learning in human movement biomechanics: Best practices, common pitfalls, and new opportunities. , 2018, Journal of biomechanics.

[28]  Olivier Ledoit,et al.  Honey, I Shrunk the Sample Covariance Matrix , 2003 .

[29]  Homoscedasticity: an overlooked critical assumption for linear regression , 2019, General Psychiatry.

[30]  J. Popp,et al.  Sample size planning for classification models. , 2012, Analytica chimica acta.

[31]  Sotiris B. Kotsiantis,et al.  Decision trees: a recent overview , 2011, Artificial Intelligence Review.

[32]  Geoffrey I. Webb,et al.  Not So Naive Bayes: Aggregating One-Dependence Estimators , 2005, Machine Learning.

[33]  Douglas M. Bates,et al.  Linear mixed models and penalized least squares , 2004 .

[34]  P. Maruff,et al.  Acoustic analysis of the effects of sustained wakefulness on speech. , 2010, The Journal of the Acoustical Society of America.

[35]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[36]  Kuldip K. Paliwal,et al.  Linear discriminant analysis for the small sample size problem: an overview , 2014, International Journal of Machine Learning and Cybernetics.

[37]  Peter Christen,et al.  A note on using the F-measure for evaluating record linkage algorithms , 2017, Statistics and Computing.

[38]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[39]  James H Thrall,et al.  Artificial Intelligence and Machine Learning in Radiology: Opportunities, Challenges, Pitfalls, and Criteria for Success. , 2018, Journal of the American College of Radiology : JACR.

[40]  W. M. Rainforth,et al.  Segregation mediated heterogeneous structure in a metastable β titanium alloy with a superior combination of strength and ductility , 2018, Scientific Reports.

[41]  Constantinos S. Pattichis,et al.  Genetics-based machine learning for the assessment of certain neuromuscular disorders , 1996, IEEE Trans. Neural Networks.

[42]  G. Bresson,et al.  Using machine learning to predict nosocomial infections and medical accidents in a NICU , 2023, Health and Technology.

[43]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[44]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[45]  J. Schmahmann,et al.  Cerebellar Functional Anatomy: a Didactic Summary Based on Human fMRI Evidence , 2019, The Cerebellum.

[46]  Harry Zhang,et al.  Exploring Conditions For The Optimality Of Naïve Bayes , 2005, Int. J. Pattern Recognit. Artif. Intell..

[47]  Jeffrey T. Leek,et al.  How to Share Data for Collaboration , 2018, The American statistician.

[48]  Alaa Tharwat,et al.  Classification assessment methods , 2020, Applied Computing and Informatics.

[49]  Guhanathan Poravi,et al.  A Review on Automated Machine Learning (AutoML) Systems , 2019, 2019 IEEE 5th International Conference for Convergence in Technology (I2CT).

[50]  W. Ilg,et al.  Features of speech and swallowing dysfunction in pre-ataxic spinocerebellar ataxia type 2 , 2020, Neurology.

[51]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[52]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[53]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[54]  Qing Zeng-Treitler,et al.  Predicting sample size required for classification performance , 2012, BMC Medical Informatics and Decision Making.

[55]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[56]  Benjamin S. Baumer,et al.  Tidy data , 2022, Modern Data Science with R.

[57]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[58]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[59]  Kevin K Dobbin,et al.  Optimally splitting cases for training and testing high dimensional classifiers , 2011, BMC Medical Genomics.

[60]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[61]  L. Kiemeney,et al.  Obesity, metabolic factors and risk of different histological types of lung cancer: A Mendelian randomization study , 2017, PloS one.

[62]  Max A. Little,et al.  Machine learning for large‐scale wearable sensor data in Parkinson's disease: Concepts, promises, pitfalls, and futures , 2016, Movement disorders : official journal of the Movement Disorder Society.

[63]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[64]  Eun Ryung Lee,et al.  PRINCIPAL COMPONENT ANALYSIS IN VERY HIGH-DIMENSIONAL SPACES , 2012 .

[65]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[66]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[67]  A. F. Ernst,et al.  Regression assumptions in clinical psychology research practice—a systematic review of common misconceptions , 2017, PeerJ.

[68]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[69]  Francisco J. Valverde-Albacete,et al.  100% Classification Accuracy Considered Harmful: The Normalized Information Transfer Factor Explains the Accuracy Paradox , 2014, PloS one.

[70]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[71]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[72]  Vili Podgorelec,et al.  Decision Trees: An Overview and Their Use in Medicine , 2002, Journal of Medical Systems.

[73]  E. Wagenmakers,et al.  Model Comparison and the Principle of Parsimony , 2015 .

[74]  Joana Figueiredo,et al.  Automatic recognition of gait patterns in human motor disorders using machine learning: A review. , 2018, Medical engineering & physics.

[75]  Shiliang Sun,et al.  A review of adaptive feature extraction and classification methods for EEG-based brain-computer interfaces , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[76]  T. Perera,et al.  Acoustic Speech Analytics Are Predictive of Cerebellar Dysfunction in Multiple Sclerosis , 2020, The Cerebellum.

[77]  Cédrick Bamba Nsimba,et al.  Nonlinear Dimensionality Reduction in Texture Classification: Is Manifold Learning Better Than PCA? , 2019, ICCS.

[78]  George C. Runger,et al.  Bias of Importance Measures for Multi-valued Attributes and Solutions , 2011, ICANN.

[79]  Douglas M. Bates,et al.  Nonlinear Regression Analysis and Its Applications , 1988 .

[80]  Daniel Hoechle Robust Standard Errors for Panel Regressions with Cross-Sectional Dependence , 2007 .

[81]  Tg Mohd Ikhwan Tg Abu Bakar Sidik,et al.  Sample Size Guidelines for Logistic Regression from Observational Studies with Large Population: Emphasis on the Accuracy Between Statistics and Parameters Based on Real Life Clinical Data , 2018, The Malaysian journal of medical sciences : MJMS.

[82]  S. De,et al.  Integrating machine learning and multiscale modeling—perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences , 2019, npj Digital Medicine.

[83]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[84]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[85]  V. Gil-Guillén,et al.  Sample size calculation to externally validate scoring systems based on logistic regression models , 2017, PloS one.

[86]  Larsson Omberg,et al.  Detecting the impact of subject characteristics on machine learning-based diagnostic applications , 2019, npj Digital Medicine.

[87]  Nicole Wenderoth,et al.  Promises, Pitfalls, and Basic Guidelines for Applying Machine Learning Classifiers to Psychiatric Imaging Data, with Autism as an Example , 2016, Front. Psychiatry.

[88]  Constantine Caramanis,et al.  Robust PCA via Outlier Pursuit , 2010, IEEE Transactions on Information Theory.

[89]  J. Concato,et al.  Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. , 1995, Journal of clinical epidemiology.

[90]  Andrea Brovelli,et al.  Multivoxel Pattern Analysis for fMRI Data: A Review , 2012, Comput. Math. Methods Medicine.