Are reported accuracies in the clinical speech machine learning literature overoptimistic?

Building clinical speech analytics models that will reliably translate in-clinic requires a realistic characterization of their performance. So, how well do we estimate the accuracy of published models in the literature? We evaluate the relationship between sample size and reported accuracy across 77 journal pub-lications that use speech to classify between healthy controls and patients with dementia. The studies are combined across three meta-analyses that use the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol. The results show that reported accuracy declines as a function of increasing sample size, with small sample size studies yielding an overoptimistic estimate of the accuracy. For correctly trained models, this is unexpected as the ability of a machine learning model to predict group membership ought to remain the same or improve with additional training data. We posit that the overoptimism is the result of a combination of publication bias and overfitting and suggest mitigation strategies.

[1]  Vikram C. Mathad,et al.  Consonant-Vowel Transition Models Based on Deep Learning for Objective Evaluation of Articulation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Visar Berisha,et al.  Automated semantic relevance as an indicator of cognitive decline: Out‐of‐sample validation on a large‐scale longitudinal dataset , 2022, Alzheimer's & dementia.

[3]  Visar Berisha,et al.  Digital medicine and the curse of dimensionality , 2021, npj Digital Medicine.

[4]  J. Donnelly,et al.  External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. , 2021, JAMA internal medicine.

[5]  F. Martínez-Sánchez,et al.  Ten Years of Research on Automatic Voice and Speech Analysis of People With Alzheimer's Disease and Mild Cognitive Impairment: A Systematic Review Article , 2021, Frontiers in Psychology.

[6]  M. Loog,et al.  The Shape of Learning Curves: A Review , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Seward B. Rutkove,et al.  Repeatability of Commonly Used Speech and Language Features for Clinical Applications , 2020, Digital Biomarkers.

[8]  Visar Berisha,et al.  Early detection and tracking of bulbar changes in ALS via frequent and remote speech analysis , 2020, npj Digital Medicine.

[9]  Saturnino Luz,et al.  Artificial Intelligence, Speech, and Language Processing Approaches to Monitoring Alzheimer’s Disease: A Systematic Review , 2020, Journal of Alzheimer's disease : JAD.

[10]  Visar Berisha,et al.  A Deep Learning Algorithm for Objective Assessment of Hypernasality in Children With Cleft Palate , 2020, IEEE Transactions on Biomedical Engineering.

[11]  Anna Korhonen,et al.  A systematic literature review of automatic Alzheimer’s disease detection from speech and language , 2020, J. Am. Medical Informatics Assoc..

[12]  Ariel V. Dowling,et al.  Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs) , 2020, npj Digital Medicine.

[13]  V. Arolt,et al.  Systematic misestimation of machine learning performance in neuroimaging studies of depression , 2019, Neuropsychopharmacology.

[14]  Ellen Poliakoff,et al.  Machine learning algorithm validation with a limited sample size , 2019, PloS one.

[15]  Visar Berisha,et al.  A Review of Automated Speech and Language Features for Assessment of Cognitive and Thought Disorders , 2019, IEEE Journal of Selected Topics in Signal Processing.

[16]  Visar Berisha,et al.  Objective Measures of Plosive Nasalization in Hypernasal Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Sree Hari Krishnan Parthasarathi,et al.  Lessons from Building Acoustic Models with a Million Hours of Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[19]  Stephen Cave,et al.  Portrayals and perceptions of AI and why they matter , 2018 .

[20]  Suchi Saria,et al.  Using Smartphones and Machine Learning to Quantify Parkinson Disease Severity: The Mobile Parkinson Disease Score , 2018, JAMA neurology.

[21]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Vince D. Calhoun,et al.  Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls , 2017, NeuroImage.

[23]  Kathleen C. Fraser,et al.  Linguistic Features Identify Alzheimer's Disease in Narrative Speech. , 2015, Journal of Alzheimer's disease : JAD.

[24]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[25]  Thomas F. Quatieri,et al.  A review of depression and suicide risk assessment using speech analysis , 2015, Speech Commun..

[26]  Jonathan Taylor,et al.  Statistical learning and selective inference , 2015, Proceedings of the National Academy of Sciences.

[27]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[28]  Andrew W. Ellis,et al.  Human Cognitive Neuropsychology: A Textbook With Readings , 2013 .

[29]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[30]  Raymond D. Kent,et al.  Maximum performance tests of speech production. , 1987, The Journal of speech and hearing disorders.

[31]  D. Freedman A Note on Screening Regression Equations , 1983 .

[32]  R. Rosenthal The file drawer problem and tolerance for null results , 1979 .

[33]  Visar Berisha,et al.  Float Like a Butterfly Sting Like a Bee: Changes in Speech Preceded Parkinsonism Diagnosis for Muhammad Ali , 2017, INTERSPEECH.