A methodology for the design of experiments in computational intelligence with multiple regression models

The design of experiments and the validation of the results achieved with them are vital in any research study. This paper focuses on the use of different Machine Learning approaches for regression tasks in the field of Computational Intelligence and especially on a correct comparison between the different results provided for different methods, as those techniques are complex systems that require further study to be fully understood. A methodology commonly accepted in Computational intelligence is implemented in an R package called RRegrs. This package includes ten simple and complex regression models to carry out predictive modeling using Machine Learning and well-known regression algorithms. The framework for experimental design presented herein is evaluated and validated against RRegrs. Our results are different for three out of five state-of-the-art simple datasets and it can be stated that the selection of the best model according to our proposal is statistically significant and relevant. It is of relevance to use a statistical approach to indicate whether the differences are statistically significant using this kind of algorithms. Furthermore, our results with three real complex datasets report different best models than with the previously published methodology. Our final goal is to provide a complete methodology for the use of different steps in order to compare the results obtained in Computational Intelligence problems, as well as from other fields, such as for bioinformatics, cheminformatics, etc., given that our proposal is open and modifiable.

[1]  Carlos Fernandez-Lozano,et al.  Markov mean properties for cell death-related protein classification. , 2014, Journal of theoretical biology.

[2]  A. Taboada,et al.  Short- and medium-term effects of experimental nitrogen fertilization on arthropods associated with Calluna vulgaris heathlands in north-west Spain. , 2008, Environmental pollution.

[3]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[4]  Andrew Emili,et al.  Protein corona fingerprinting predicts the cellular interaction of gold and silver nanoparticles. , 2014, ACS nano.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[7]  Marcos Gestal,et al.  Exploring patterns of epigenetic information with data mining techniques. , 2012, Current pharmaceutical design.

[8]  Hanoch Senderowitz,et al.  A reliable computational workflow for the selection of optimal screening libraries , 2015, Journal of Cheminformatics.

[9]  Robert B. O'Hara,et al.  Do not log‐transform count data , 2010 .

[10]  Carlos Fernandez-Lozano,et al.  Texture classification using feature selection and kernel-based techniques , 2015, Soft Computing.

[11]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[12]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[13]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[14]  K. Doksum Robust Procedures for Some Linear Models with one Observation per Cell , 1967 .

[15]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[16]  Egon L. Willighagen,et al.  RRegrs: an R package for computer-aided model selection with multiple regression models , 2015, Journal of Cheminformatics.

[17]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[18]  Gianluca Bontempi,et al.  Statistical foundations of machine learning , 2013 .

[19]  Roberto Todeschini,et al.  Prediction of Acute Aquatic Toxicity toward Daphnia Magna by using the GA-kNN Method , 2014, Alternatives to laboratory animals : ATLA.

[20]  Monya Baker,et al.  Reproducibility: Seek out stronger science , 2016, Nature.

[21]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[23]  Carlos Fernandez-Lozano,et al.  Kernel-based feature selection techniques for transport proteins based on star graph topological indices. , 2013, Current topics in medicinal chemistry.

[24]  Eric R. Ziegel,et al.  An Introduction to Generalized Linear Models , 2002, Technometrics.

[25]  Carlos Fernandez-Lozano,et al.  Texture analysis in gel electrophoresis images using an integrative kernel-based approach , 2016, Scientific Reports.

[26]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[27]  D. Quade Using Weighted Rankings in the Analysis of Complete Blocks with Additive Block Effects , 1979 .

[28]  L. Leemis Applied Linear Regression Models , 1991 .

[29]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[30]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[31]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[32]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[33]  M. Bartlett Properties of Sufficiency and Statistical Tests , 1992 .

[34]  Jerzy Leszczynski,et al.  Towards understanding mechanisms governing cytotoxicity of metal oxides nanoparticles: Hints from nano-QSAR studies , 2015, Nanotoxicology.

[35]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[36]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[37]  N. Jaspen Applied Nonparametric Statistics , 1979 .

[38]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[39]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[40]  Georgia Tsiliki,et al.  Using the RRegrs R package for automating predictive modelling , 2015 .

[41]  R. O’Hara,et al.  Do not log‐transform count data , 2010 .

[42]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[43]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .