Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Linear regression is often used as a diagnostic tool to understand the relative contributions of operational variables to some key performance indicator or response variable. However, owing to the nature of plant operations, predictor variables tend to be correlated, often highly so, and this can lead to significant complications in assessing the importance of these variables. Shapley regression is seen as the only axiomatic approach to deal with this problem but has almost exclusively been used with linear models to date. In this paper, the approach is extended to random forests, and the results are compared with some of the empirical variable importance measures widely used with these models, i.e., permutation and Gini variable importance measures. Four case studies are considered, of which two are based on simulated data and two on real world data from the mineral process industries. These case studies suggest that the random forest Shapley variable importance measure may be a more reliable indicator of the influence of predictor variables than the other measures that were considered. Moreover, the results obtained with the Gini variable importance measure was as reliable or better than that obtained with the permutation measure of the random forest.

[1]  U. Grömping Variable importance in regression models , 2015 .

[2]  Moe Thandar Wynn,et al.  Isolating the impact of rock properties and operational settings on minerals processing performance: A data-driven approach , 2018, Minerals Engineering.

[3]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[4]  Derek B. Apel,et al.  Purities prediction in a manufacturing froth flotation plant: the deep learning techniques , 2020, Neural Computing and Applications.

[5]  D. Budescu,et al.  The dominance analysis approach for comparing predictors in multiple regression. , 2003, Psychological methods.

[6]  Chris Aldrich,et al.  Unsupervised Process Fault Detection with Random Forests , 2010 .

[7]  S. S. Matin,et al.  Power-draw prediction by random forest based on operating parameters for an industrial ball mill , 2020 .

[8]  S. S. Matin,et al.  Flotation of coarse particles by hydrodynamic cavitation generated in the presence of conventional reagents , 2019, Separation and Purification Technology.

[9]  Tang Zhaohui,et al.  DTCWT-based zinc fast roughing working condition identification , 2018, Chinese Journal of Chemical Engineering.

[10]  Mariano P. Consens,et al.  Predicting blast-induced outcomes using random forest models of multi-year blasting data from an open pit mine , 2019, Bulletin of Engineering Geology and the Environment.

[11]  Osnat Israeli,et al.  A Shapley-based decomposition of the R-Square of a linear regression , 2007 .

[12]  C. Aldrich,et al.  Empirical comparison of tree ensemble variable importance measures , 2011 .

[13]  Frank Huettner,et al.  Axiomatic arguments for decomposing goodness of fit according to Shapley and Owen values , 2012 .

[14]  S. S. Matin,et al.  Prediction of froth flotation responses based on various conditioning parameters by Random Forest method , 2017 .

[15]  S. S. Matin,et al.  Modeling of free swelling index based on variable importance measurements of parent coal properties by random forest method , 2016 .

[16]  Emmanuel John M. Carranza,et al.  Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines) , 2015, Comput. Geosci..

[17]  D. Budescu Dominance analysis: A new approach to the problem of relative importance of predictors in multiple regression. , 1993 .

[18]  C. Aldrich,et al.  Consumption of steel grinding media in mills – A review , 2013 .

[19]  Chris Aldrich,et al.  Predicting the Operating States of Grinding Circuits by Use of Recurrence Texture Analysis of Time Series Data , 2018 .

[20]  Russell G. Death,et al.  An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data , 2004 .

[21]  Bertrand Michel,et al.  Correlation and variable importance in random forests , 2013, Statistics and Computing.

[22]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[23]  W. Kruskal Relative Importance by Averaging Over Orderings , 1987 .

[24]  Chris Aldrich,et al.  Flotation Froth Image Analysis by Use of a Dynamic Feature Extraction Algorithm , 2016 .

[25]  S. Lipovetsky,et al.  Analysis of regression in game theory approach , 2001 .

[26]  Chris Aldrich,et al.  Fault detection and diagnosis with random forest feature extraction and variable importance methods , 2010 .

[27]  Chris Aldrich,et al.  Interpretation of nonlinear relationships between process variables by use of random forests , 2012 .

[28]  Chris Aldrich,et al.  Change point detection in time series data with random forests , 2010 .

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Mahdi Khodadadzadeh,et al.  Evaluating the performance of hyperspectral short-wave infrared sensors for the pre-sorting of complex ores using machine learning methods , 2020 .

[31]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.