A robust data scaling algorithm to improve classification accuracies in biomedical data

BackgroundMachine learning models have been adapted in biomedical research and practice for knowledge discovery and decision support. While mainstream biomedical informatics research focuses on developing more accurate models, the importance of data preprocessing draws less attention. We propose the Generalized Logistic (GL) algorithm that scales data uniformly to an appropriate interval by learning a generalized logistic function to fit the empirical cumulative distribution function of the data. The GL algorithm is simple yet effective; it is intrinsically robust to outliers, so it is particularly suitable for diagnostic/classification models in clinical/medical applications where the number of samples is usually small; it scales the data in a nonlinear fashion, which leads to potential improvement in accuracy.ResultsTo evaluate the effectiveness of the proposed algorithm, we conducted experiments on 16 binary classification tasks with different variable types and cover a wide range of applications. The resultant performance in terms of area under the receiver operation characteristic curve (AUROC) and percentage of correct classification showed that models learned using data scaled by the GL algorithm outperform the ones using data scaled by the Min-max and the Z-score algorithm, which are the most commonly used data scaling algorithms.ConclusionThe proposed GL algorithm is simple and effective. It is robust to outliers, so no additional denoising or outlier detection step is needed in data preprocessing. Empirical results also show models learned from data scaled by the GL algorithm have higher accuracy compared to the commonly used data scaling algorithms.

[1]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[2]  Olvi L. Mangasarian,et al.  Nuclear feature extraction for breast tumor diagnosis , 1993, Electronic Imaging.

[3]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[4]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[5]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[7]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[8]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[9]  F. Zhan,et al.  The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. , 2003, The New England journal of medicine.

[10]  E. Acuña,et al.  A Meta analysis study of outlier detection methods in classification , 2004 .

[11]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[12]  N. Christakis,et al.  Prognostic factors in advanced cancer patients: evidence-based clinical recommendations--a study by the Steering Committee of the European Association for Palliative Care. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[13]  J. Jossinet Variability of impedivity in normal and pathological breast tissue , 1996, Medical and Biological Engineering and Computing.

[14]  Max A. Little,et al.  Suitability of Dysphonia Measurements for Telemonitoring of Parkinson's Disease , 2008, IEEE Transactions on Biomedical Engineering.

[15]  S. Bowling,et al.  A Logistic Approximation to The Cumulative Normal Distribution , 2009 .

[16]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[17]  Oh-Hyung Kwon,et al.  Aberrant up-regulation of LAMB3 and LAMC2 by promoter demethylation in gastric cancer. , 2011, Biochemical and biophysical research communications.

[18]  F. Jasmine,et al.  A genome-wide DNA methylation study in colorectal carcinoma , 2011, BMC Medical Genomics.

[19]  T. Down,et al.  A functional methylome map of ulcerative colitis , 2012, Genome research.

[20]  A. Mobasheri,et al.  Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. , 2013, Omics : a journal of integrative biology.

[21]  Max A. Little,et al.  Objective Automatic Assessment of Rehabilitative Speech Treatment in Parkinson's Disease , 2014, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[22]  Robert Koprowski,et al.  Machine learning, medical diagnosis, and biomedical engineering research - commentary , 2014, BioMedical Engineering OnLine.

[23]  Lennart Martens,et al.  Machine learning applications in proteomics research: How the past can boost the future , 2014, Proteomics.

[24]  Piero P. Bonissone,et al.  Machine Learning Applications , 2015, Handbook of Computational Intelligence.

[25]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[26]  Zoran Obradovic,et al.  A robust data scaling algorithm for gene expression classification , 2015, 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE).

[27]  Wilhelm Burger,et al.  Digital Image Processing - An Algorithmic Introduction using Java , 2008, Texts in Computer Science.

[28]  Shapla Rani Ghosh,et al.  A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis , 2016 .