Optimization enhanced genetic algorithm-support vector regression for the prediction of compound retention indices in gas chromatography

A new method using genetic algorithm and support vector regression with parameter optimization (GASVRPO) was developed for the prediction of compound retention indices (RI) in gas chromatography. The dataset used in this work consists of 252 compounds extracted from the Molecular Operating Environment (MOE) boiling point database. Molecular descriptors were calculated by descriptor tools of the MOE software package. After removing redundant descriptors, 151 descriptors were obtained for each compound. A genetic algorithm (GA) was used to select the best subset of molecular descriptors and the best parameters of SVR to optimize the prediction performance of compound retention indices. A 10-fold cross-validation method was used to evaluate the prediction performance. We compared the performance of our proposed model with three existing methods: GA coupled with multiple linear regression (GAMLR), the subset selected by GAMLR used to train SVR (GAMLRSVR), and GA on SVR (GASVR). The experimental results demonstrate that our proposed GASVRPO model has better predictive performance than other existing models with R2>0.967 and RMSE=49.94. The prediction accuracy of GASVRPO model is 96% at 10% of prediction variation.

[1]  T. Hancock,et al.  A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies , 2005 .

[2]  Zne-Jung Lee,et al.  Parameter determination of support vector machine and feature selection using simulated annealing approach , 2008, Appl. Soft Comput..

[3]  Pavel Pospisil,et al.  Prediction Models of Retention Indices for Increased Confidence in Structural Elucidation during Complex Matrix Analysis: Application to Gas Chromatography Coupled with High-Resolution Mass Spectrometry. , 2016, Analytical chemistry.

[4]  E. Kováts,et al.  GAS CHROMATOGRAPHISCHE CHARAKTERISIERUNG ORGANISCHER VERBINDUNGEN , 1958 .

[5]  Ruisheng Zhang,et al.  Prediction of gas chromatographic retention indices by the use of radial basis function neural networks. , 2002, Talanta.

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[8]  Andrea Massa,et al.  Genetic algorithm (GA)-enhanced almost difference set (ADS)-based approach for array thinning , 2011 .

[9]  Zahra Garkani-Nejad,et al.  Use of Self-Training Artificial Neural Networks in a QSRR Study of a Diverse Set of Organic Compounds , 2009 .

[10]  E. Kováts,et al.  Gas‐chromatographische Charakterisierung organischer Verbindungen. Teil 1: Retentionsindices aliphatischer Halogenide, Alkohole, Aldehyde und Ketone , 1958 .

[11]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[12]  Lars I. Nord,et al.  Prediction of liquid chromatographic retention times of steroids by three-dimensional structure descriptors and partial least squares modeling , 1998 .

[13]  Carlos M. Fonseca,et al.  GENETIC ALGORITHM TOOLS FOR CONTROL SYSTEMS ENGINEERING , 1994 .

[14]  H. Vandendool,et al.  A GENERALIZATION OF THE RETENTION INDEX SYSTEM INCLUDING LINEAR TEMPERATURE PROGRAMMED GAS-LIQUID PARTITION CHROMATOGRAPHY. , 1963, Journal of chromatography.

[15]  Ruisheng Zhang,et al.  The prediction for gas chromatographic retention indices of saturated esters on stationary phases of different polarity. , 2002, Talanta.

[16]  W. Dixon,et al.  Simplified Statistics for Small Numbers of Observations , 1951 .

[17]  D.-S. Huang,et al.  Radial Basis Probabilistic Neural Networks: Model and Application , 1999, Int. J. Pattern Recognit. Artif. Intell..

[18]  Károly Héberger,et al.  Partial least squares modeling of retention data of oxo compounds in gas chromatography , 2000 .

[19]  Mohammed Hossein Fatemi,et al.  Predictions of chromatographic retention indices of alkylphenols with support vector machines and multiple linear regression. , 2009, Journal of separation science.

[20]  Zhide Hu,et al.  QSPR prediction of GC retention indices for nitrogen-containing polycyclic aromatic compounds from heuristically computed molecular descriptors. , 2005, Talanta.

[21]  Roeland C. H. J. van Ham,et al.  Automated procedure for candidate compound selection in GC-MS metabolomics based on prediction of Kovats retention index , 2009, Bioinform..

[22]  S. Stein An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data , 1999 .

[23]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[24]  De-Shuang Huang,et al.  A Constructive Hybrid Structure Optimization Methodology for Radial Basis Probabilistic Neural Networks , 2008, IEEE Transactions on Neural Networks.

[25]  R. Todeschini,et al.  Detecting bad regression models: multicriteria fitness functions in regression analysis , 2004 .

[26]  K. Héberger Quantitative structure-(chromatographic) retention relationships. , 2007, Journal of chromatography. A.

[27]  Roman Kaliszan,et al.  Quantitative structure-chromatographic retention relationships , 1987 .

[28]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[29]  Tobias Kind,et al.  Use of boiling point-Lee retention index correlation for rapid review of gas chromatography-mass spectrometry data , 2003 .

[30]  L. Buydens,et al.  Determination of optimal support vector regression parameters by genetic algorithms and simplex optimization , 2005 .

[31]  Jibo Wang,et al.  Evaluating the performances of quantitative structure-retention relationship models with different sets of molecular descriptors and databases for high-performance liquid chromatography predictions. , 2009, Journal of chromatography. A.

[32]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..