The k-nearest neighbor technique with local linear regression

In a standard k-nearest neighbor (kNN) technique, imputations of unit-level values in the variables of interest (Y) are based on the k-nearest neighbors in a set of reference units. Nearest is defined with respect to a distance metric in the space of auxiliary variables (X). This study evaluates kNN imputations of Y with a selection, by the same distance metric, of k-nearest locally weighted regression models. Imputations are obtained as predictions using the X values of the k-nearest neighbors in the population. In simulated random sampling from three artificial multivariate populations and two actual univariate populations and sampling units composed of a single population element or a cluster of four elements, the new kNN technique: (1) improved the correlation between an imputation and its actual value; (2) lowered the root mean square error (RMSE) of imputations; (3) increased the slope in regressions of actual y values regressed against their imputed values; (4) performed relatively best with k values of 4 and sample sizes of 200 or greater; (5) compared favorably with a recently proposed kNN calibration procedure; and (6) had a higher (15–28%) RMSE than with a simple local linear regression. Distribution matching had a consistent negative effect (+10%) on RMSE.

[1]  A. Hudak,et al.  Corrigendum to “Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data”[Remote Sensing of Environment, 112: 2232–2245] , 2009 .

[2]  Erkki Tomppo,et al.  Using coarse scale forest variables as ancillary information and weighting of variables in k-NN estimation: a genetic algorithm approach , 2004 .

[3]  M. Wand,et al.  Multivariate Locally Weighted Least Squares Regression , 1994 .

[4]  Hailemariam Temesgen,et al.  Imputing tree-lists from aerial attributes for complex stands of south-eastern British Columbia , 2003 .

[5]  Piermaria Corona,et al.  Estimation of Mediterranean forest attributes by the application of k‐NN procedures to multitemporal Landsat ETM+ images , 2005 .

[6]  Annika Kangas,et al.  Comparison of k-MSN and kriging in local prediction , 2012 .

[7]  Matti Maltamo,et al.  The K‐nearest‐neighbour method for estimating basal‐area diameter distribution , 1997 .

[8]  A. C. Rencher Methods of multivariate analysis , 1995 .

[9]  M. Maltamo,et al.  Variable selection strategies for nearest neighbor imputation methods used in remote sensing based forest inventory , 2012 .

[10]  J. Breidenbach,et al.  Comparison of nearest neighbour approaches for small area estimation of tree species-specific forest inventory attributes in central Europe using airborne laser scanner data , 2010, European Journal of Forest Research.

[11]  S. Magnussen,et al.  A model-assisted k-nearest neighbour approach to remove extrapolation bias , 2010 .

[12]  D. Streeter Forest ecosystem , 1980, Nature.

[13]  E. Smith Methods of Multivariate Analysis , 1997 .

[14]  Arto Haara,et al.  Comparing K Nearest Neighbours Methods and Linear Regression - Is There Reason To Select One Over the Other? , 2012, Math. Comput. For. Nat. Resour. Sci..

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[17]  Lynne Stokes,et al.  Introduction to Variance Estimation (2nd ed.) , 2008 .

[18]  S. Magnussen An assessment of three variance estimators for the k-nearest neighbour technique , 2013 .

[19]  Nicholas L. Crookston,et al.  The roles of nearest neighbor methods in imputing missing data in forest inventory and monitoring databases , 2009 .

[20]  Cees G. M. Snoek,et al.  Variable Selection , 2019, Model-Based Clustering and Classification for Data Science.

[21]  Hailemariam Temesgen,et al.  Comparison of Nearest Neighbor Methods for Estimating Basal Area and Stems per Hectare Using Aerial Auxiliary Variables , 2005, Forest Science.

[22]  Jessica de Wolff,et al.  Introduction to the Model , 1998 .

[23]  J. Heikkinen,et al.  Estimating areal means and variances of forest attributes using the k-Nearest Neighbors technique and satellite imagery , 2007 .

[24]  L. Holmström,et al.  Smoothing methodology for predicting regional averages in multi-source forest inventory , 2008 .

[25]  F. Breidt,et al.  Non‐parametric small area estimation using penalized spline regression , 2008 .

[26]  Hal S. Stern,et al.  Models for Distributions on Permutations , 1990 .

[27]  Robert Chambers,et al.  An Introduction to Model-Based Survey Sampling with Applications , 2012 .

[28]  Erkki Tomppo,et al.  Designing and Conducting a Forest Inventory - case: 9th National Forest Inventory of Finland , 2011 .

[29]  F. Zwiers,et al.  The interpretation and estimation of effective sample size , 1984 .

[30]  Rob J Hyndman,et al.  Sample Quantiles in Statistical Packages , 1996 .

[31]  Albert R. Stage,et al.  Most Similar Neighbor: An Improved Sampling Inference Procedure for Natural Resource Planning , 1995, Forest Science.

[32]  P. Corona,et al.  A matching procedure to improve k-NN estimation of forest attribute maps , 2012 .

[33]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[34]  P. Corona,et al.  K-NN FOREST: a software for the non-parametric prediction and mapping of environmental variables by the k-Nearest Neighbors algorithm , 2012 .

[35]  Erik Næsset,et al.  Using remotely sensed data to construct and assess forest attribute maps and related spatial products , 2010 .

[36]  Ronald E. McRoberts,et al.  Diagnostic tools for nearest neighbors techniques when used with satellite imagery , 2009 .

[37]  S. Magnussen,et al.  Model-based mean square error estimators for k-nearest neighbour predictions and applications using remotely sensed data for forest inventories , 2009 .

[38]  A. Meher Prasad,et al.  Multivariate Simulation and Multimodal Dependence Modeling of Vehicle Axle Weights with Copulas , 2006 .

[39]  A. Azzalini A class of distributions which includes the normal ones , 1985 .

[40]  MandallazDaniel Design-based properties of some small-area estimators in forest inventory with two-phase sampling , 2013 .

[41]  Hailemariam Temesgen,et al.  Estimating Cavity Tree Abundance Using Nearest Neighbor Imputation Methods for Western Oregon and Washington Forests , 2008 .

[42]  A. Hudak,et al.  Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data , 2008 .

[43]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[44]  Piermaria Corona,et al.  Forest ecosystem inventory and monitoring as a framework for terrestrial natural renewable resource survey programmes , 2002 .

[45]  A. Winsor Sampling techniques. , 2000, Nursing times.

[46]  Nicholas L. Crookston,et al.  Partitioning error components for accuracy-assessment of near-neighbor methods of imputation , 2007 .

[47]  W. Cleveland,et al.  Regression by local fitting: Methods, properties, and computational algorithms , 1988 .

[48]  Hailemariam Temesgen,et al.  A Comparison of the Spatial Linear Model to Nearest Neighbor (k-NN) Methods for Forestry Applications , 2013, PloS one.

[49]  Nicholas M. Kiefer,et al.  Simple Robust Testing of Hypotheses in Nonlinear Models , 2001 .

[50]  Piermaria Corona,et al.  Design-based approach to k-nearest neighbours technique for coupling field and remotely sensed data in forest surveys , 2009 .

[51]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[52]  W. Walker,et al.  Mapping forest structure for wildlife habitat analysis using multi-sensor (LiDAR, SAR/InSAR, ETM+, Quickbird) synergy , 2006 .

[53]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.

[54]  Nicholas C. Coops,et al.  Estimating stand structural details using nearest neighbor analyses to link ground data, forest cover maps, and Landsat imagery , 2008 .

[55]  I. Gijbels,et al.  Bandwidth Selection in Nonparametric Kernel Testing , 2008 .

[56]  Lewis H. Shoemaker,et al.  Fixing the F Test for Equal Variances , 2003 .

[57]  M. Bauer,et al.  Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method , 2001 .

[58]  E. Tomppo The Finnish multisource National Forest Inventory: Small-area estimation and map production , 2006 .

[59]  J. Fransson,et al.  Combining remotely sensed optical and radar data in kNN-estimation of forest variables , 2003 .

[60]  S. Magnussen,et al.  A better alternative to Wald's test-statistic for simple goodness-of-fit tests under one-stage cluster sampling , 2006 .

[61]  Steen Magnussen,et al.  A resampling variance estimator for the k nearest neighbours technique , 2010 .

[62]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[63]  Geert Molenberghs,et al.  The Effective Sample Size and an Alternative Small-Sample Degrees-of-Freedom Method , 2009 .

[64]  J. Friedman Multivariate adaptive regression splines , 1990 .

[65]  E. Tomppo,et al.  Satellite image-based national forest inventory of finland for publication in the igarss'91 digest , 1991, [Proceedings] IGARSS'91 Remote Sensing: Global Monitoring for Earth Management.

[66]  Annika Kangas,et al.  Methods based on k-nearest neighbor regression in the prediction of basal area diameter distribution , 1998 .

[67]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[68]  A. Lister,et al.  Post-Modeling Histogram Matching of Maps Produced Using Regression Trees , 2006 .

[69]  Mircea D. Farcas,et al.  About Bernstein polynomials , 2008 .

[70]  M. Shin,et al.  Comparison of the k-nearest neighbor technique with geographical calibration for estimating forest growing stock volumeThis article is one of a selection of papers from Extending Forest Inventory and Monitoring over Space and Time. , 2011 .

[71]  Nicholas L. Crookston,et al.  yaImpute: An R Package for kNN Imputation , 2008 .

[72]  L. Zhang,et al.  Local Modeling of Tree Growth by Geographically Weighted Regression , 2004 .

[73]  E. Tomppo,et al.  Selecting estimation parameters for the Finnish multisource National Forest Inventory , 2001 .

[74]  Hailemariam Temesgen,et al.  Estimating Current Forest Attributes from Paneled Inventory Data Using Plot-Level Imputation: A Study from the Pacific Northwest , 2009 .

[75]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[76]  Lynne Stokes Introduction to Variance Estimation , 2008 .