Creation of training dataset for Sentinel-2 land cover classification

Supervised classification of satellite images is performed based on utilization of reference training data. Therefore, the availability and quality of reference data highly influences the results and the course of the entire classification process. In the Sentinel-2 Global Land Cover (S2GLC) project Sentinel-2 images are classified using Random Forest (RF) algorithm powered by training points selected from existing low resolution land cover databases. This approach allows to perform the classification process in a highly automatic manner without much intervention of an operator. An alternative method for creating training dataset has been developed in order to ensure the implementation of the S2GLC classification in case of limited access to the required land cover databases or their low quality. The proposed method is a semi-automatic process initiated by an operator, who by a visual interpretation, indicates only several starting samples for the classes of interest. Afterwards, utilizing this limited set of initial training samples, hundreds or thousands of training samples with similar spectral characteristics are automatically selected from the image. Such a set of data, can be further used as an alternative source of training data for land cover classification on much greater scale. Comparing to the traditional approach, in which all samples or training areas are manually indicated, the developed method is very effective and also allows for processing data more rapidly. The semi-automatic training can be used as an alternative or supplement the training dataset applied in the S2GLC classification approach.