Interpretation of nonlinear relationships between process variables by use of random forests

Abstract Better understanding of process phenomena is dependent on the interpretation of models capturing the relationships between the process variables. Although linear regression is used routinely in the mineral process industries for this purpose, it may not be useful where the relationships between variables are nonlinear or complex. Under these circumstances, nonlinear methods, such as neural networks or decision trees can be used to develop reliable models, without necessarily giving any particular or explicit insight into the relationships between the process and the target variables. This is a major drawback in situations where such information would be very important, such as in fault identification or gaining a better understanding of the fundamentals of a process. In this paper, the use of variable importance measures and partial dependency plots generated by random forest models are proposed as a practical tool that can be used to surmount this problem. In particular, it is shown that important variables can be flagged by appropriate threshold generated by inclusion of dummy variables in the system. Moreover, the results of the study indicate that random forest models can reliably identify the influence of individual variables, even in the presence of high levels of additive noise. This would make it a useful tool in continuous process improvement and root cause analysis of abnormal process behaviour.

[1]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[2]  Timothy J. Napier-Munn,et al.  Two empirical hydrocyclone models revisited , 2003 .

[3]  Steve Horvath,et al.  Repetitive sequence environment distinguishes housekeeping genes. , 2007, Gene.

[4]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[5]  D. Carlisle,et al.  Predicting the biological condition of streams: use of geospatial indicators of natural and anthropogenic characteristics of watersheds , 2009, Environmental monitoring and assessment.

[6]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[7]  R. Berk,et al.  Forecasting murder within a population of probationers and parolees: a high stakes application of statistical learning , 2009 .

[8]  Chris Aldrich,et al.  Kernel-based fault diagnosis on mineral processing plants , 2006 .

[9]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[10]  Steffen Oppel,et al.  Using an algorithmic model to reveal individually variable movement decisions in a wintering sea duck. , 2009, The Journal of animal ecology.

[11]  James D. Malley,et al.  Predictor correlation impacts machine learning algorithms: implications for genomic studies , 2009, Bioinform..

[12]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[13]  R. Razali,et al.  Statistical modelling of a shaking table separator part one , 1990 .

[14]  R. A. Kleiv,et al.  Modelling copper adsorption on olivine process dust using a simple linear multivariable regression model , 2002 .

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  A. Zeileis,et al.  Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance , 2008 .

[17]  Nick J. Miles,et al.  The use of grey level measurement in predicting coal flotation performance , 1996 .

[18]  M. Whittingham,et al.  Models of climate associations and distributions of amphibians in Italy , 2009, Ecological Research.

[19]  Markus A. Reuter,et al.  Monitoring of metallurgical reactors by the use of topographic mapping of process data , 1999 .

[20]  A. G. Heidema,et al.  A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes. , 2008, Physiological genomics.

[21]  C. Lennert‐Cody,et al.  Effects of gear characteristics on the presence of bigeye tuna (Thunnus obesus) in the catches of the purse-seine fishery of the eastern Pacific Ocean , 2008 .

[22]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[23]  Michael Ghil,et al.  Weather Regime Prediction Using Statistical Learning , 2005 .

[24]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[25]  Chris Aldrich,et al.  Statistical monitoring of a grinding circuit: An industrial case study , 2006 .

[26]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[27]  Cesare Furlanello,et al.  GIS and the Random Forest Predictor: Integration in R for Tick-Borne Disease Risk Assessment , 2003 .

[28]  Abdelaziz Berrado,et al.  Modeling and characterizing of the thixoforming of steel process parameters – the case of forming load , 2010 .