Use of Random forest in the identification of important variables

Abstract Random Forest (RF) technique has been shown to be promising in the supervised classification applied in different matrices. However, approaches to identifying significant variables that weight the model are scarce, in the classification problems. In this paper, we propose a methodology for the selection of variables of greater relevance in the construction of RF models. For the application of this methodology, classification models were developed to discriminating crude oil samples, about to their maximum pour point (MPP). In this sense, data from MPP (ASTM D5853) of 105 crude oil samples, their hydrogen (1H) NMR spectra and carbon (13C) NMR spectra were acquired. With MPP ranging from −54 °C to 39 °C, two classes were assigned: the first containing 43 samples with MPP value ≤ −9 °C, and, the second, 62 samples with MPP value > −9 °C. The 1H NMR models, with 90% accuracy, and 13C NMR, with 71% accuracy, were used in the selection of variable method. The results showed that the methodology proposed to select variables was effective in the distinction of the variables that best contributed to the discrimination of oils. Therefore, this new tool enabled a greater understanding of the interest chemical information, contained in the spectra and its relationship with the MPP property of the crude oil samples.

[1]  Usman Qamar,et al.  MV5: A Clinical Decision Support Framework for Heart Disease Prediction Using Majority Vote Based Classifier Ensemble , 2014, Arabian Journal for Science and Engineering.

[2]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[3]  Mariana Belgiu,et al.  Random forest in remote sensing: A review of applications and future directions , 2016 .

[4]  J. Coello,et al.  Effect of Data Preprocessing Methods in Near-Infrared Diffuse Reflectance Spectroscopy for the Determination of the Active Compound in a Pharmaceutical Preparation , 1997 .

[5]  Evelyne Vigneau,et al.  Random forests: A machine learning methodology to highlight the volatile organic compounds involved in olfactory perception , 2018, Food Quality and Preference.

[6]  Hoeil Chung,et al.  Random forest as a potential multivariate method for near-infrared (NIR) spectroscopic analysis of complex mixture samples: Gasoline and naphtha , 2013 .

[7]  Ronei J. Poppi,et al.  Determination of Saturates, Aromatics, and Polars in Crude Oil by 13C NMR and Support Vector Regression with Variable Selection by Genetic Algorithm , 2016 .

[8]  Douglas D. Mooney,et al.  Use of Comprehensive Two-Dimensional Gas Chromatography with Time-of-Flight Mass Spectrometric Detection and Random Forest Pattern Recognition Techniques for Classifying Chemical Threat Agents and Detecting Chemical Attribution Signatures. , 2016, Analytical chemistry.

[9]  L. Duarte,et al.  Study of Distillation Temperature Curves from Brazilian Crude Oil by 1H Nuclear Magnetic Resonance Spectroscopy in Association with Partial Least Squares Regression , 2017 .

[10]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[11]  Wei Liu,et al.  Application of terahertz spectroscopy imaging for discrimination of transgenic rice seeds with chemometrics. , 2016, Food chemistry.

[12]  V. Grigor'ev,et al.  Binary Classification of CNS and PNS Drugs , 2017, Pharmaceutical Chemistry Journal.

[13]  Dong-Sheng Cao,et al.  In silico classification of human maximum recommended daily dose based on modified random forest and substructure fingerprint. , 2011, Analytica chimica acta.

[14]  R. Barnes,et al.  Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra , 1989 .

[15]  F Savorani,et al.  icoshift: A versatile tool for the rapid alignment of 1D NMR spectra. , 2010, Journal of magnetic resonance.

[16]  Ronei Jesus Poppi,et al.  Visible and near infrared spectroscopy coupled to random forest to quantify some soil quality parameters. , 2018, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[17]  R. Brereton Chemometrics , 2018, Chemometrics and Cheminformatics in Aquatic Toxicology.

[18]  Wenchuan Guo,et al.  Discrimination of “Hayward” Kiwifruits Treated with Forchlorfenuron at Different Concentrations Using Hyperspectral Imaging Technology , 2017, Food Analytical Methods.

[19]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[20]  Tianlong Zhang,et al.  Classification of steel samples by laser-induced breakdown spectroscopy and random forest , 2016 .

[21]  A. Sayago,et al.  Combination of complementary data mining methods for geographical characterization of extra virgin olive oils based on mineral composition. , 2018, Food chemistry.

[22]  J. Poveda,et al.  Average molecular parameters of heavy crude oils and their fractions using NMR spectroscopy , 2012 .

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[25]  T. Fearn,et al.  On the geometry of SNV and MSC , 2009 .

[26]  Pradeep Kurup,et al.  Decision tree approach for classification and dimensionality reduction of electronic nose data , 2011 .

[27]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[28]  Agnieszka Smolinska,et al.  Unsupervised random forest: a tutorial with case studies , 2016 .

[29]  Improvement on Pour Point of Heavy Oils by Adding Organic Solvents , 2017 .

[30]  E. Lucas,et al.  Wax Behavior in Crude Oils by Pour Point Analyses , 2018 .

[31]  E. R. Castro,et al.  Determination of crude oil physicochemical properties by high-temperature gas chromatography associated with multivariate calibration , 2018 .

[32]  I. S. Ismail,et al.  Discriminative Analysis of Different Grades of Gaharu (Aquilaria malaccensis Lamk.) via 1H-NMR-Based Metabolomics Using PLS-DA and Random Forests Classification Models , 2017, Molecules.

[33]  R. Pellerano,et al.  Intra-regional classification of grape seeds produced in Mendoza province (Argentina) by multi-elemental analysis and chemometrics tools. , 2018, Food chemistry.

[34]  João F. P. Bassane,et al.  Limitations of the Pour Point Measurement and the Influence of the Oil Composition on Its Detection Using Principal Component Analysis , 2014 .

[35]  Johannes R. Sveinsson,et al.  Random Forests for land cover classification , 2006, Pattern Recognit. Lett..

[36]  Jinyu Zhang,et al.  FT-MIR and NIR spectral data fusion: a synergetic strategy for the geographical traceability of Panax notoginseng , 2017, Analytical and Bioanalytical Chemistry.

[37]  Francesco Savorani,et al.  icoshift: An effective tool for the alignment of chromatographic data. , 2011, Journal of chromatography. A.

[38]  Qihao Weng,et al.  A survey of image classification methods and techniques for improving classification performance , 2007 .

[39]  R. Edrada-Ebel,et al.  A chemometric study of chromatograms of tea extracts by correlation optimization warping in conjunction with PCA, support vector machines and random forest data modeling. , 2009, Analytica chimica acta.

[40]  Hyuk-Chul Kwon,et al.  Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification , 2011, IEICE Trans. Inf. Syst..