A guideline to determine the training sample size when applying big data mining methods in clinical decision making

Biomedicine is a field rich in a variety of heterogeneous, evolving, complex and unstructured massive data, coming from autonomous sources (i.e. HACE theorem). Big data mining has become the most fascinating and fastest growing area which enables the selection, exploring and modeling the vast amount of medical data to help clinical decision making, prevent medication error, and enhance patients' outcomes. Given the complexity and unstructured data nature in biomedicine, it was acknowledged that there is no single best data mining method for all applications. Indeed, an appropriate process and algorithm for big data mining is essential for obtaining a truthful result. Up to date, however, there is no guideline for this, especially about a fair sample size in the training set for reliable results. Sample size is of central importance because the biomedical data don't come cheap — they take time and human power to acquire the data and usually are very expensive. On the other hand, small sample size may result in the overestimates of the predictive accuracy by overfitting to the data. The purpose of this paper is to provide a guideline for determining the sample size that can result in a robust accuracy. Because the increment in data volume causes complexity and had a significant impact on the accuracy, we examined the relationship among sample size, data variation and performance of different data mining methods, including SVM, Naïve Bayes, Logistic Regression and J48, by using simulation and two sets of biomedical data. The simulation result revealed that the sample size can dramatically affect the performance of data mining methods under a given data variation and this effect is most manifest in nonlinear case. For experimental biomedical data, it is essential to examine the impact of sample size and data variation on the performance in order to determine the sample size.

[1]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[2]  C. Stinear,et al.  Prediction of recovery of motor function after stroke , 2010, The Lancet Neurology.

[3]  E Aubert,et al.  QEEG Prognostic Value in Acute Stroke , 2007, Clinical EEG and neuroscience.

[4]  Matthew Petoe,et al.  The PREP algorithm predicts potential for upper limb recovery after stroke. , 2012, Brain : a journal of neurology.

[5]  Moritz Helmstaedter,et al.  The Mutual Inspirations of Machine Learning and Neuroscience , 2015, Neuron.

[6]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[7]  R M Gardner Artificial intelligence in medicine—is it ready? , 1986, International journal of clinical monitoring and computing.

[8]  Igor Jurisica,et al.  Knowledge Discovery and Data Mining in Biomedical Informatics: State-of-the-Art and Future Challenges , 2014 .

[9]  Daniel L. Hartl,et al.  GeneMerge - Post-genomic Analysis, Data Mining, and Hypothesis Testing , 2003, Bioinform..

[10]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[11]  Chun-Chuan Chen,et al.  EEG-based motor network biomarkers for identifying target patients with stroke for upper limb rehabilitation and its construct validity , 2017, PloS one.

[12]  H. Koh,et al.  Data mining applications in healthcare. , 2005, Journal of healthcare information management : JHIM.

[13]  Gregory F. Cooper,et al.  Application of Bayesian Logistic Regression to Mining Biomedical Data , 2014, AMIA.

[14]  Blaz Zupan,et al.  Predictive data mining in clinical medicine: Current issues and guidelines , 2008, Int. J. Medical Informatics.

[15]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[16]  Gereon R Fink,et al.  Individual prediction of chronic motor outcome in the acute post‐stroke stage: Behavioral parameters versus functional imaging , 2015, Human brain mapping.

[17]  T. Barrette,et al.  ONCOMINE: a cancer microarray database and integrated data-mining platform. , 2004, Neoplasia.