Instance selection and classification tree analysis for large spatial datasets in digital soil mapping

Abstract Digital soil mapping is currently experiencing a tremendous increase in available environmental covariates and resolution for spatial soil predictions, resulting in computational problems in terms of limited data handling capabilities of machine learning approaches. This is of particular importance when gridded spatial soil class maps are used as a basis for predictions containing large amounts of redundant instances and noisy information. In this study we systematically analyze the effect of instance selection, which aims at reducing sample size, while preserving or even increasing prediction accuracy. On a soil class dataset with 95,000 instances we tested two sampling approaches in relation to parameter settings of decision tree based learning: proportional and disproportional stratified random sampling. An automated grid search approach was used to find the best performing parameter settings of the decision tree. The results show that an appropriate sampling method in combination with a grid search method returns better results than those obtained when grid learning is applied without instance selection. Instance selection increases prediction accuracy especially if the frequency distribution of the soil classes is low compared to the surrounding area. However, instance selection does not help in pedological interpretation. Nevertheless, it is a valuable pre-processing method to handle large spatial high resolution datasets in digital soil class prediction in terms of accuracy and computational costs. As suggested on the basis of the results of this study, spatially constrained instance selection as well as boundary based digital soil mapping in terms of soil taxonomic contrast should be investigated in future pedometric research.

[1]  Feng Qi,et al.  Knowledge Discovery from Area‐Class Resource Maps: Data Preprocessing for Noise Reduction , 2004, Trans. GIS.

[2]  Huan Liu,et al.  Data Reduction via Instance Selection , 2001 .

[3]  Dominique Arrouays,et al.  Extrapolating regional soil landscapes from an existing soil map: Sampling intensity, validation procedures, and integration of spatial context , 2008 .

[4]  Deutsche Ausgabe World Reference Base for Soil Resources 2006 , 2007 .

[5]  D. J. Brus,et al.  Random sampling or geostatistical modelling? Choosing between design-based and model-based sampling strategies for soil (with discussion) , 1997 .

[6]  Philippe Lagacherie,et al.  Addressing Geographical Data Errors in a Classification Tree for Soil Unit Prediction , 1997, Int. J. Geogr. Inf. Sci..

[7]  Philippe Lagacherie,et al.  Mapping of reference area representativity using a mathematical soilscape distance , 2001 .

[8]  Elisabeth N. Bui,et al.  Spatial data mining for enhanced soil map modelling , 2002, Int. J. Geogr. Inf. Sci..

[9]  Huan Liu,et al.  Instance Selection and Construction for Data Mining , 2001 .

[10]  Thorsten Behrens,et al.  Digital soil mapping in Germany—a review , 2006 .

[11]  Jacques-Eric Bergez,et al.  A hierarchical partitioning method for optimizing irrigation strategies , 2004 .

[12]  Elisabeth N. Bui,et al.  Soil survey as a knowledge system , 2004 .

[13]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[14]  Elisabeth N. Bui,et al.  Extracting soil-landscape rules from previous soil surveys , 1999 .

[15]  Alex B. McBratney,et al.  An overview of pedometric techniques for use in soil survey , 2000 .

[16]  R. DeFries,et al.  Classification trees: an alternative to traditional land cover classifiers , 1996 .

[17]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[18]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[19]  Diansheng Guo,et al.  GEOSPATIAL DATA MING AND KNOWLEDGE DISCOVERY USING DECISION TREE ALGORITHM—A CASE STUDY OF SOIL DATA SET OF THE YELLOW RIVER DELTA , 1999 .

[20]  Wei-Yin Loh,et al.  Application of box-cox transformations to discrimination for the two-class problem , 1992 .

[21]  P. van Beek,et al.  Designing efficient soil survey schemes with a knowledge-based system using dynamic programming , 1997 .

[22]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[23]  Alfred E. Hartemink,et al.  Digital Soil Mapping with Limited Data , 2008 .

[24]  P. Scull,et al.  The application of classification tree analysis to soil type prediction in a desert landscape , 2005 .

[25]  D. J. Brus,et al.  Incorporating models of spatial variation in sampling strategies for soil , 1993 .

[26]  A. Zhu Mapping soil landscape as spatial continua: The Neural Network Approach , 2000 .

[27]  Bin Zhou,et al.  Automated soil resources mapping based on decision tree and Bayesian predictive modeling , 2004, Journal of Zhejiang University. Science.

[28]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[29]  Francis D. Hole,et al.  An approach to landscape analysis with emphasis on soils , 1978 .

[30]  J. Beek,et al.  Developments in Soil Science , 2019, Global Change and Forest Soils.

[31]  Thorsten Behrens,et al.  Digital soil mapping using artificial neural networks , 2005 .

[32]  José F. Moreno,et al.  CART-based feature selection of hyperspectral images for crop cover classification , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[33]  Lakhmi C. Jain,et al.  Advanced Techniques in Knowledge Discovery and Data Mining (Advanced Information and Knowledge Processing) , 2005 .

[34]  Budiman Minasny,et al.  On digital soil mapping , 2003 .

[35]  E. Giasson,et al.  Assessing the economic value of soil information using decision analysis techniques , 2000 .

[36]  Jesús Muñoz,et al.  Comparison of statistical methods commonly used in predictive modelling , 2004 .

[37]  Chris Moran,et al.  A strategy to fill gaps in soil survey over large spatial extents: an example from the Murray-Darling basin of Australia , 2003 .

[38]  Budiman Minasny,et al.  Estimation and potential improvement of the quality of legacy soil samples for digital soil mapping , 2007 .

[39]  Jerome H. Friedman Multivariate adaptive regression splines (with discussion) , 1991 .

[40]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[41]  O. Debeir,et al.  Remote Sensing Classification of Spectral, Spatial and Contextual Data Using Multiple Classifier Systems , 2001 .

[42]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[43]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[44]  Wai Lam,et al.  Learning via Prototype Generation and Filtering , 2001 .

[45]  Yaochu Jin,et al.  Multi-Objective Machine Learning , 2006, Studies in Computational Intelligence.

[46]  P. Burrough,et al.  Principles of geographical information systems , 1998 .

[47]  D. P. Shrestha,et al.  Modelling land degradation in the Nepalese Himalaya , 2004 .

[48]  Pamela C. Cosman,et al.  Automatic tracking, feature extraction and classification of C. elegans phenotypes , 2004, IEEE Transactions on Biomedical Engineering.

[49]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[50]  C. Brodley,et al.  Decision tree classification of land cover from remotely sensed data , 1997 .

[51]  Kai-Tai Fang,et al.  The Classification Tree Combined with SIR and Its Applications to Classification of Mass Spectra , 2003, Journal of Data Science.

[52]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[53]  N. McKenzie,et al.  Spatial prediction of soil properties using environmental correlation , 1999 .

[54]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[55]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[56]  Huan Liu,et al.  Sampling: Knowing Whole from Its Part , 2001 .

[57]  Tsunenori Ishioka,et al.  Evaluation of criteria for information retrieval , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[58]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[59]  Jorma Laaksonen,et al.  LVQ_PAK: The Learning Vector Quantization Program Package , 1996 .

[60]  J. Deckers,et al.  World Reference Base for Soil Resources , 1998 .

[61]  Samy Bengio,et al.  Local Machine Learning Models for Spatial Data Analysis , 2000 .

[62]  Violette Geissen,et al.  Superficial and subterranean soil erosion in Tabasco, tropical Mexico : Development of a decision tree modeling approach , 2007 .

[63]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[64]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[65]  J. Friedman Multivariate adaptive regression splines , 1990 .