Testing the reliability and stability of the internal accuracy assessment of random forest for classifying tree defoliation levels using different validation methods

In this study, the strength and reliability of internal accuracy estimate built in random forest (RF) ensemble classifier was evaluated. Specifically, we compared the reliability of the internal validation methods of RF with independent data-sets of different splitting options for defoliation classification. Furthermore, we set out to statistically validate the best independent split option for image classification using RF and multispectral Rapideye imagery. Results show that the internal accuracy measure yields comparable results with those derived from an independent test data-set. More important, it was observed that the errors produced by the internal validation methods of RF were relatively stable as statistically shown by the lower confidence interval obtained as compared to the independent test data. Results also showed that the 70–30% split option had the lowest mean standard errors (0.2351) and hence highest accuracy when compared to the other split options. The study confirms the reliability and stability of the internal bootstrapping estimate of accuracy built within the random forest algorithm.

[1]  André Stumpf,et al.  Object-oriented mapping of landslides using Random Forests , 2011 .

[2]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[3]  Rick L. Lawrence,et al.  Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (RandomForest) , 2006 .

[4]  Simon Bernard,et al.  Random Forest Classifiers : A Survey and Future Research Directions , 2013 .

[5]  Frank Jochen Dieterle,et al.  Multianalyte Quantifications by Means of Integration of Artificial Neural Networks, Genetic Algorithms and Chemometrics for Time-Resolved Analytical Data , 2003 .

[6]  A. Zeileis,et al.  Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance , 2008 .

[7]  Matthew W. Mitchell Bias of the Random Forest Out-of-Bag (OOB) Error for Certain Input Parameters , 2011 .

[8]  Tom Bylander,et al.  Estimating Generalization Error on Two-Class Datasets Using Out-of-Bag Estimates , 2002, Machine Learning.

[9]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[10]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[11]  Clement Atzberger,et al.  Derivation of biophysical variables from Earth observation data: validation and statistical measures , 2012 .

[12]  Mario Chica-Olmo,et al.  An assessment of the effectiveness of a random forest classifier for land-cover classification , 2012 .

[13]  Björn Waske,et al.  RANDOM FORESTS FOR CLASSIFYING MULTI-TEMPORAL SAR DATA , 2007 .

[14]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Onisimo Mutanga,et al.  Spectral Discrimination of Insect Defoliation Levels in Mopane Woodland Using Hyperspectral Data , 2014, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[17]  Russell G. Congalton,et al.  Assessing the accuracy of remotely sensed data : principles and practices , 1998 .

[18]  Onisimo Mutanga,et al.  Field spectrometry of papyrus vegetation (Cyperus papyrus L.) in swamp wetlands of St Lucia, South Africa , 2009, 2009 IEEE International Geoscience and Remote Sensing Symposium.

[19]  O. Mutanga,et al.  Discriminating the papyrus vegetation (Cyperus papyrus L.) and its co-existent species using random forest and hyperspectral data resampled to HYMAP , 2012 .

[20]  O. Mutanga,et al.  Evaluating the impact of red-edge band from Rapideye image for classifying insect defoliation levels , 2014 .

[21]  Anne Ruiz,et al.  Storms prediction : Logistic regression vs random forest for unbalanced data , 2007, 0804.0650.

[22]  O. Mutanga,et al.  High density biomass estimation: Testing the utility of Vegetation Indices and the Random Forest Regression algorithm , 2011 .

[23]  Wolter Arnberg,et al.  Interpretation of mopane woodlands using air photos with implications on satellite image classification , 2002 .

[24]  A. Skidmore,et al.  Integrating imaging spectroscopy and neural networks to map grass quality in the Kruger National Park, South Africa , 2004 .

[25]  Onisimo Mutanga,et al.  Intra-and-Inter Species Biomass Prediction in a Plantation Forest: Testing the Utility of High Spatial Resolution Spaceborne Multispectral RapidEye Sensor and Advanced Machine Learning Algorithms , 2014, Sensors.

[26]  Ned Horning,et al.  Random Forests : An algorithm for image classification and generation of continuous fields data sets , 2010 .

[27]  Xiao Liu,et al.  Semi-supervised Node Splitting for Random Forest Construction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Jeff Czapla-Myers,et al.  Absolute radiometric calibration of the RapidEye multispectral imager using the reflectance-based vicarious calibration method , 2011 .

[29]  S. Adelabu,et al.  Employing ground and satellite-based QuickBird data and random forest to discriminate five tree species in a Southern African Woodland , 2015 .