Mining Concise Datasets for Testing Satellite-Data-Based Land-Cover Classifiers Meant for Large Geographic Areas

Obtaining an accurate estimate of a land-cover classifier's performance over a wide geographic area is a challenging problem due to the need to generate the ground truth that represents the entire area, which may be thousands of square kilometers in size. The current best approach for solving this problem constructs a test set by drawing samples randomly from the entire area—with a human supplying the true label for each such sample—with the hope that the labeled data thus collected capture statistically all of the data diversity in the area. A major shortcoming of this approach is that, in an interactive session, it is difficult for a human to ensure that the information provided by the next data sample chosen by the random sampler is nonredundant with respect to the data already collected. In order to reduce the annotation burden caused by this uncertainty, it makes sense to remove any redundancies from the entire dataset before presenting its samples to the human for annotation. This article presents a framework that uses a combination of clustering and compression to create a concise-set representation of the land-cover data for a large geographic area. Whereas clustering is achieved by applying locality-sensitive hashing to the data elements, compression is achieved by choosing a single data element to represent a cluster. This framework reduces the annotation burden on the human and makes it more likely that the human would persevere during the annotation stage. We validate our framework experimentally by comparing it with the traditional random sampling approach using WorldView2 satellite imagery.

[1]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[2]  Mike Izbicki,et al.  Faster cover trees , 2015, ICML.

[3]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[4]  J. Townshend,et al.  Global land cover classifications at 8 km spatial resolution: The use of training data derived from Landsat imagery in decision tree classifiers , 1998 .

[5]  Avinash C. Kak,et al.  A variance-based Bayesian framework for improving Land-Cover classification through wide-area learning from large geographic regions , 2016, Comput. Vis. Image Underst..

[6]  Masatoshi Yoshikawa,et al.  The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation , 2000, VLDB.

[7]  Gang Hua,et al.  Towards large scale land-cover recognition of satellite images , 2011, 2011 8th International Conference on Information, Communications & Signal Processing.

[8]  M. M. Kilgo,et al.  Statistics and Data Analysis: From Elementary to Intermediate , 2001 .

[9]  James D. Wickham,et al.  A priori evaluation of two-stage cluster sampling for accuracy assessment of large-area land-cover maps , 2004 .

[10]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[11]  Liming Chen,et al.  Color quantization for image processing using self information , 2007, 2007 6th International Conference on Information, Communications & Signal Processing.

[12]  M. Slaney,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors [Lecture Notes] , 2008, IEEE Signal Processing Magazine.

[13]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[14]  Limin Yang,et al.  Development of a 2001 National land-cover database for the United States , 2004 .

[15]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[16]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  Konstantinos Karantzalos,et al.  A Scalable Geospatial Web Service for Near Real-Time, High-Resolution Land Cover Mapping , 2015, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[19]  Isabelle Braud,et al.  Land cover mapping using aerial and VHR satellite images for distributed hydrological modelling of periurban catchments: Application to the Yzeron catchment (Lyon, France) , 2013 .

[20]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[21]  Wenzhong Shi,et al.  A Multilevel Stratified Spatial Sampling Approach for the Quality Assessment of Remote-Sensing-Derived Products , 2015, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[22]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[23]  Rafael C. González,et al.  Digital image processing, 3rd Edition , 2008 .

[24]  D. Mount ANN Programming Manual , 1998 .

[25]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[26]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.