Gas Chromatographic Retention Index Prediction Using Multimodal Machine Learning

Gas chromatography is a widely used method in analytical chemistry and metabolomics. Using gas chromatography, vaporizable compounds can be separated for their further identification. Retention indices are standardized values that depend only on a chemical structure of a compound and on a stationary phase and characterize the retention of a compound in a chromatographic system. Retention index prediction is an important task because databases contain experimental values for a small fraction of all possible molecules, while this information is usable for untargeted analysis. In this work, we consider four machine learning models for retention index prediction: 1D and 2D convolutional neural networks, deep residual multilayer perceptron, and gradient boosting. String representation of the molecule, 2D representation of the chemical structure, molecular descriptors and fingerprints, and molecular descriptors are used as inputs of these four models, respectively, along with information about the stationary phase. The first and third models show the best performance, while the other two perform slightly worse. The models predict retention index values for various standard and semi-standard non-polar stationary phases. Further improvement in performance was achieved using a linear model that uses the results of four previous models as inputs (model stacking). The models were tested using various diverse data sets: flavor compounds, essential oils, metabolomics-related compounds. Achieved accuracy: median absolute and percentage errors – 6–40 units and 0.8-2.2%. Accuracy depends on a test data set. The stacking model outperforms previously reported approaches for all test data sets. Parameters of a pre-trained model and some source code are provided.

[1]  E. Fukusaki,et al.  Integrated Strategy for Unknown EI-MS Identification Using Quality Control Calibration Curve, Multivariate Analysis, EI-MS Spectral Database, and Retention Index Prediction. , 2017, Analytical chemistry.

[2]  A. Zhokhov,et al.  Methodological Approaches to the Calculation and Prediction of Retention Indices in Capillary Gas Chromatography , 2018, Journal of Analytical Chemistry.

[3]  Stephen E. Stein,et al.  Estimation of Kováts Retention Indices Using Group Contributions , 2007, J. Chem. Inf. Model..

[4]  C. Steinbeck,et al.  The Chemistry Development Kit (CDK): An Open‐Source Java Library for Chemo‐ and Bioinformatics. , 2003 .

[5]  Stephen E. Stein,et al.  Estimation of normal boiling points from group contributions , 1994, J. Chem. Inf. Comput. Sci..

[6]  Amarjit Budhiraja,et al.  Augmenting Molecular Images with Vector Representations as a Featurization Technique for Drug Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yiliang Sun,et al.  Practical aspects in the utilization of the Sadtler Standard Gas Chromatography Retention Index Library , 1993 .

[8]  Emma L. Schymanski,et al.  Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects , 2016 .

[9]  Yves Gibon,et al.  GMD@CSB.DB: the Golm Metabolome Database , 2005, Bioinform..

[10]  S. Degroeve,et al.  Comprehensive and Empirical Evaluation of Machine Learning Algorithms for Small Molecule LC Retention Time Prediction. , 2019, Analytical chemistry.

[11]  Eklas Hossain,et al.  Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers , 2020, IEEE Access.

[12]  W Vycudilik,et al.  Prediction of gas chromatographic retention indices of a diverse set of toxicologically relevant compounds. , 2004, Journal of chromatography. A.

[13]  A. Toropova,et al.  Prediction of gas chromatographic retention indices based on Monte Carlo method. , 2017, Talanta.

[14]  L. Mondello,et al.  Linear retention indices in gas chromatographic analysis: a review , 2008 .

[15]  Milton L. Lee,et al.  Linear retention index system for polycyclic aromatic compounds , 1982 .

[16]  D. Matyushin,et al.  Various aspects of retention index usage for GC-MS library search: A statistical investigation using a diverse data set , 2020 .

[17]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[18]  Zhentian Lei,et al.  MetExpert: An expert system to enhance gas chromatography‒mass spectrometry-based metabolite identifications. , 2018, Analytica chimica acta.

[19]  E Benfenati,et al.  Could deep learning in neural networks improve the QSAR models? , 2019, SAR and QSAR in environmental research.

[20]  Wei-keng Liao,et al.  CheMixNet: Mixed DNN Architectures for Predicting Chemical Properties using Multiple Molecular Representations , 2018, ArXiv.

[21]  Abdul Sattar,et al.  Toxicity Prediction by Multimodal Deep Learning , 2019, PKAW.

[22]  Gary Siuzdak,et al.  The METLIN small molecule dataset for machine learning-based retention time prediction , 2019, Nature Communications.

[23]  Y. Marrero-Ponce,et al.  QSRR prediction of gas chromatography retention indices of essential oil components , 2017, Chemical Papers.

[24]  Hai-Feng Chen,et al.  Quantitative predictions of gas chromatography retention indexes with support vector machines, radial basis neural networks and multiple linear regression. , 2008, Analytica chimica acta.

[25]  Abhinav Vishnu,et al.  How Much Chemistry Does a Deep Neural Network Need to Know to Make Accurate Predictions? , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  C. Cramers,et al.  High precision capillary gas chromatography of hydrocarbons , 1974 .

[27]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[28]  Xiaolin Cheng,et al.  STarFish: A Stacked Ensemble Target Fishing Approach and its Application to Natural Products , 2019, J. Chem. Inf. Model..

[29]  Roberto Todeschini,et al.  Impact of Molecular Descriptors on Computational Models. , 2018, Methods in molecular biology.

[30]  Shanshan Guo,et al.  A Multi-Stage Self-Adaptive Classifier Ensemble Model With Application in Credit Scoring , 2019, IEEE Access.

[31]  Jody C. May,et al.  Predicting Ion Mobility Collision Cross-Sections Using a Deep Neural Network: DeepCCS. , 2019, Analytical chemistry.

[32]  Alaaeldin M. Hafez,et al.  Feature Extraction Methods in Quantitative Structure–Activity Relationship Modeling: A Comparative Study , 2020, IEEE Access.

[33]  Shinji Hamada,et al.  Molecular activity prediction using deep learning software library , 2016, 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA).

[34]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[35]  O. Fiehn,et al.  FiehnLib: mass spectral and retention index libraries for metabolomics based on quadrupole and time-of-flight gas chromatography/mass spectrometry. , 2009, Analytical chemistry.

[36]  Chi Chen,et al.  Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals , 2018, Chemistry of Materials.

[37]  G. Tarján,et al.  Thirtieth anniversary of the retention index according to Kováts in gas-liquid chromatography , 1989 .

[38]  Yutaka Saito,et al.  Convolutional neural network based on SMILES representation of compounds for detecting chemical motif , 2018, BMC Bioinformatics.

[39]  E. Kováts,et al.  Gas‐chromatographische Charakterisierung organischer Verbindungen. Teil 1: Retentionsindices aliphatischer Halogenide, Alkohole, Aldehyde und Ketone , 1958 .

[40]  Manuela Pavan,et al.  DRAGON SOFTWARE: AN EASY APPROACH TO MOLECULAR DESCRIPTOR CALCULATIONS , 2006 .

[41]  Roeland C. H. J. van Ham,et al.  Automated procedure for candidate compound selection in GC-MS metabolomics based on prediction of Kovats retention index , 2009, Bioinform..

[42]  J. Brezmes,et al.  Baitmet, a computational approach for GC–MS library-driven metabolite profiling , 2017, Metabolomics.

[43]  Qiang Ling,et al.  Vehicle Exhaust Concentration Estimation Based on an Improved Stacking Model , 2019, IEEE Access.

[44]  Yizeng Liang,et al.  Comparison of quantitative structure-retention relationship models on four stationary phases with different polarity for a diverse set of flavor compounds. , 2012, Journal of chromatography. A.

[45]  Adrià Cereto-Massagué,et al.  Molecular fingerprint similarity search in virtual screening. , 2015, Methods.

[46]  Pablo R. Duchowicz,et al.  QSPR analysis for the retention index of flavors and fragrances on a OV-101 column , 2015 .

[47]  Pavel Pospisil,et al.  Prediction Models of Retention Indices for Increased Confidence in Structural Elucidation during Complex Matrix Analysis: Application to Gas Chromatography Coupled with High-Resolution Mass Spectrometry. , 2016, Analytical chemistry.

[48]  Zhimin Zhang,et al.  Predicting Molecular Fingerprint from Electron−Ionization Mass Spectrum with Deep Neural Networks , 2020, bioRxiv.

[49]  Vesna Rastija,et al.  PyDescriptor : A new PyMOL plugin for calculating thousands of easily understandable molecular descriptors , 2017 .

[50]  Danishuddin,et al.  Descriptors and their selection methods in QSAR analysis: paradigm for drug design. , 2016, Drug discovery today.

[51]  Jannis Born,et al.  Towards Explainable Anticancer Compound Sensitivity Prediction via Multimodal Attention-based Convolutional Encoders , 2019, Molecular pharmaceutics.

[52]  Abhinav Vishnu,et al.  Multimodal Deep Neural Networks using Both Engineered and Learned Representations for Biodegradability Prediction , 2018, ArXiv.

[53]  Terry E. Acree,et al.  Flavornet: A database of aroma compounds based on odor potency in natural products , 1998 .

[54]  K. Héberger Quantitative structure-(chromatographic) retention relationships. , 2007, Journal of chromatography. A.

[55]  D. Matyushin,et al.  A deep convolutional neural network for the estimation of gas chromatographic retention indices. , 2019, Journal of chromatography. A.

[56]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Dong-Qing Wei,et al.  PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method , 2018, Front. Microbiol..

[58]  Zhiqiang Wei,et al.  Molecular Property Prediction Based on a Multichannel Substructure Graph , 2020, IEEE Access.

[59]  Bahram Hemmateenejad,et al.  Quantitative structure-retention relationship for the Kovats retention indices of a large set of terpenes: a combined data splitting-feature selection strategy. , 2007, Analytica chimica acta.

[60]  Ruisheng Zhang,et al.  Large Artificial Neural Networks Applied to the Prediction of Retention Indices of Acyclic and Cyclic Alkanes, Alkenes, Alcohols, Esters, Ketones and Ethers , 1998, Comput. Chem..

[61]  R. P. Adams Identification of Essential Oil Components By Gas Chromatography/Mass Spectrometry , 2007 .

[62]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[63]  Gradient boosting for the prediction of gas chromatographic retention indices , 2019 .

[64]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[65]  Chun-Hou Zheng,et al.  A large scale test dataset to determine optimal retention index threshold based on three mass spectral similarity measures. , 2012, Journal of chromatography. A.

[66]  Ryan P. Adams,et al.  Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks , 2018, ACS central science.

[67]  T. Shibamoto,et al.  Qualitative Analysis of Flavor and Fragrance Volatiles by Glass Capillary Gas Chromatography , 1980 .

[68]  Abhinav Vishnu,et al.  SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties , 2017, ArXiv.

[69]  Friedrich Rippmann,et al.  Interpretable Deep Learning in Drug Discovery , 2019, Explainable AI.

[70]  P. Duchowicz,et al.  Quantitative structure-property relationship analysis for the retention index of fragrance-like compounds on a polar stationary phase. , 2015, Journal of chromatography. A.

[71]  Zahra Garkani-Nejad,et al.  Use of Self-Training Artificial Neural Networks in a QSRR Study of a Diverse Set of Organic Compounds , 2009 .