On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping

Random Forest (RF) is a widely used algorithm for classification of remotely sensed data. Through a case study in peatland classification using LiDAR derivatives, we present an analysis of the effects of input data characteristics on RF classifications (including RF out-of-bag error, independent classification accuracy and class proportion error). Training data selection and specific input variables (i.e., image channels) have a large impact on the overall accuracy of the image classification. High-dimension datasets should be reduced so that only uncorrelated important variables are used in classifications. Despite the fact that RF is an ensemble approach, independent error assessments should be used to evaluate RF results, and iterative classifications are recommended to assess the stability of predicted classes. Results are also shown to be highly sensitive to the size of the training data set. In addition to being as large as possible, the training data sets used in RF classification should also be (a) randomly distributed or created in a manner that allows for the class proportions of the training data to be representative of actual class proportions in the landscape; and (b) should have minimal spatial autocorrelation to improve classification results and to mitigate inflated estimates of RF out-of-bag classification accuracy.

[1]  Timothy A. Warner,et al.  Assessing machine-learning algorithms and image- and lidar-derived variables for GEOBIA classification of mining and mine reclamation , 2015 .

[2]  Paul M. Mather,et al.  An assessment of the effectiveness of decision tree methods for land cover classification , 2003 .

[3]  O. Mutanga,et al.  Discriminating the papyrus vegetation (Cyperus papyrus L.) and its co-existent species using random forest and hyperspectral data resampled to HYMAP , 2012 .

[4]  Rick L. Lawrence,et al.  Mapping invasive plants using hyperspectral imagery and Breiman Cutler classifications (RandomForest) , 2006 .

[5]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[6]  Russell G. Congalton,et al.  A review of assessing the accuracy of classifications of remotely sensed data , 1991 .

[7]  G. Foody Thematic map comparison: Evaluating the statistical significance of differences in classification accuracy , 2004 .

[8]  Stacy L. Ozesmi,et al.  Satellite remote sensing of wetlands , 2002, Wetlands Ecology and Management.

[9]  Markku Kuitunen,et al.  Coupling high-resolution satellite imagery with ALS-based canopy height model and digital elevation model in object-based boreal forest habitat type classification , 2014 .

[10]  E. Næsset,et al.  Prediction of species specific forest inventory attributes using a nonparametric semi-individual tree crown approach based on fused airborne laser scanning and multispectral data , 2010 .

[11]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Johannes R. Sveinsson,et al.  Random Forests for land cover classification , 2006, Pattern Recognit. Lett..

[14]  Nobuyuki Kobayashi,et al.  Parameter tuning in the support vector machine and random forest and their performances in cross- and same-year crop classification using TerraSAR-X , 2014 .

[15]  S. Bridgham,et al.  Multiple limiting gradients in peatlands: A call for a new paradigm , 2009, Wetlands.

[16]  Giles M. Foody,et al.  Status of land cover classification accuracy assessment , 2002 .

[17]  Aaron J. Smith,et al.  A Semi-Automated, Multi-Source Data Fusion Update of a Wetland Inventory for East-Central Minnesota, USA , 2015, Wetlands.

[18]  Trisalyn A. Nelson,et al.  Potential contributions of remote sensing to ecosystem service assessments , 2014 .

[19]  Gerard Govers,et al.  A GIS procedure for automatically calculating the USLE LS factor on topographically complex landscape units , 1996 .

[20]  Joseph F. Knight,et al.  Influence of Multi-Source and Multi-Temporal Remotely Sensed and Ancillary Data on the Accuracy of Random Forest Classification of Wetlands in Northern Minnesota , 2013, Remote. Sens..

[21]  William L. Quinton,et al.  A decision-tree classification for low-lying complex land cover types within the zone of discontinuous permafrost , 2014 .

[22]  Alan H. Strahler,et al.  A note on procedures used for accuracy assessment in land cover maps derived from AVHRR data , 2000 .

[23]  André Stumpf,et al.  bject-oriented mapping of urban trees using Random Forest lassifiers , 2013 .

[24]  L. Anselin Local Indicators of Spatial Association—LISA , 2010 .

[25]  Ö. Akar,et al.  Integrating multiple texture methods and NDVI to the Random Forest classification algorithm to detect tea and hazelnut plantation areas in northeast Turkey , 2015 .

[26]  Robert J. Hijmans,et al.  Geographic Data Analysis and Modeling , 2015 .

[27]  David L. Verbyla,et al.  Optimistic bias in classification accuracy assessment , 1996 .

[28]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[29]  André Stumpf,et al.  Active Learning in the Spatial Domain for Remote Sensing Image Classification , 2014, IEEE Transactions on Geoscience and Remote Sensing.

[30]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[31]  Yan Wang,et al.  The Effects of Point or Polygon Based Training Data on RandomForest Classification Accuracy of Wetlands , 2015, Remote. Sens..

[32]  Martin Kopecký,et al.  Using topographic wetness index in vegetation ecology: does the algorithm matter? , 2010 .

[33]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[34]  Steven E. Franklin,et al.  A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using SPOT-5 HRG imagery , 2012 .

[35]  Giles M. Foody,et al.  Toward intelligent training of supervised image classifications: directing training data acquisition for SVM classification , 2004 .

[36]  F. Wilcoxon SOME RAPID APPROXIMATE STATISTICAL PROCEDURES , 1950 .

[37]  K. Millard,et al.  Wetland mapping with LiDAR derivatives, SAR polarimetric decompositions, and LiDAR–SAR fusion using a random forest classifier , 2013 .

[38]  John P. Wilson,et al.  Terrain analysis : principles and applications , 2000 .

[39]  Lindi J. Quackenbush,et al.  Impact of training and validation sample selection on classification accuracy and accuracy assessment when using reference polygons in object-based classification , 2013 .