Correlation and redundancy on machine learning performance for chemical databases

Variable reduction is an essential step for establishing a robust, accurate, and generalized machine learning model. Variable correlation and redundancy/total correlation are the primary considerations in many variable reduction methods given that they directly impact model performances. However, their effects vary from one class of databases to another. To clarify their effects on regression models on the basis of small chemical databases, a series of calculations are performed. Regression models are built on features with various correlation coefficients and redundancies by 4 machine learning methods: random forest, support vector machine, extreme learning machine, and multiple linear regression. The results suggest that the correlation is, as expected, closely related to the prediction accuracy; ie, generally, the features with large correlation coefficients regarding to response variables achieve better regression models than those with lower ones. However, for the redundancy, no trends on the performances of regression models are disclosed. This may indicate that for these chemical molecular databases, the redundancy might not be a primary concern.

[1]  Yan Li,et al.  Feature Selection with Variable Interaction , 2014 .

[2]  Lin Li,et al.  An Accurate and Efficient Method to Predict Y-NO Bond Homolysis Bond Dissociation Energies , 2013 .

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[5]  Feiping Nie,et al.  Effective Discriminative Feature Selection With Nontrivial Solution , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Q. M. Jonathan Wu,et al.  Human face recognition based on multidimensional PCA and extreme learning machine , 2011, Pattern Recognit..

[7]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[8]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[9]  Martyn G. Ford,et al.  Unsupervised Forward Selection: A Method for Eliminating Redundant Variables , 2000, J. Chem. Inf. Comput. Sci..

[10]  Roberto Kawakami Harrop Galvão,et al.  A method for calibration and validation subset partitioning. , 2005, Talanta.

[11]  Frank Neese,et al.  The ORCA program system , 2012 .

[12]  A Mani-Varnosfaderani,et al.  Therapeutic index modeling and predictive QSAR of novel thiazolidin-4-one analogs against Toxoplasma gondii. , 2015, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[13]  Punyaphol Horata,et al.  Robust extreme learning machine , 2013, Neurocomputing.

[14]  Hui Li,et al.  A cascaded QSAR model for efficient prediction of overall power conversion efficiency of all‐organic dye‐sensitized solar cells , 2015, J. Comput. Chem..

[15]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[16]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[17]  Qing-Wen Wang,et al.  Ranks and the least-norm of the general solution to a system of quaternion matrix equations , 2009 .

[18]  K. Héberger Sum of ranking differences compares methods or models fairly , 2010 .

[19]  Hilko van der Voet,et al.  Comparing the predictive accuracy of models using a simple randomization test , 1994 .

[20]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Stjepan Oreski,et al.  Effects of dataset characteristics on the performance of feature selection techniques , 2017, Appl. Soft Comput..

[22]  Hui Li,et al.  A machine learning correction for DFT non-covalent interactions based on the S22, S66 and X40 benchmark databases , 2016, Journal of Cheminformatics.

[23]  Benjamin Yee Shing Li,et al.  Indefinite kernel ridge regression and its application on QSAR modelling , 2015, Neurocomputing.

[24]  Jun Wang,et al.  Comparison of ELM, RF, and SVM on E-nose and E-tongue to trace the quality status of mandarin (Citrus unshiu Marc.) , 2015 .

[25]  A. Lobbrecht,et al.  Optimization of water level monitoring network in polder systems using information theory , 2010 .

[26]  Roberto Todeschini,et al.  A novel variable reduction method adapted from space-filling designs , 2014 .

[27]  Louis Hodes,et al.  Selection of Descriptors According to Discrimination and Redundancy. Application to Chemical Structure Searching , 1976, J. Chem. Inf. Comput. Sci..

[28]  Richard G. Brereton Orthogonality, uncorrelatedness, and linear independence of vectors , 2016 .

[29]  K. Héberger,et al.  Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers , 2011 .

[30]  Mamun Bin Ibne Reaz,et al.  A novel SVM-kNN-PSO ensemble method for intrusion detection system , 2016, Appl. Soft Comput..

[31]  Lihong Hu,et al.  Combined first-principles calculation and neural-network correction approach for heat of formation , 2003 .

[32]  Peter Willett,et al.  Combination of Similarity Rankings Using Data Fusion , 2013, J. Chem. Inf. Model..

[33]  Jinhui Tang,et al.  Unsupervised Feature Selection via Nonnegative Spectral Analysis and Redundancy Control , 2015, IEEE Transactions on Image Processing.

[34]  William J. Welch,et al.  Computer-aided design of experiments , 1981 .

[35]  Qinghua Hu,et al.  Robust feature selection based on regularized brownboost loss , 2013, Knowl. Based Syst..

[36]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.