Use of a tree‐structured hierarchical model for estimation of location and uncertainty in multivariate spatial data

Analysis and modeling of spatial data are of considerable interest in many applications. However, the prediction of geographical features from a set of chemical measurements on a set of geographically distinct samples has never been explored. We report a new, tree‐structured hierarchical model for the estimation of geographical location of spatially distributed samples from their chemical measurements. The tree‐structured hierarchical modeling used in this study involves a set of geographic regions stored in a hierarchical tree structure, with each nonterminal node representing a classifier and each terminal node representing a regression model. Once the tree‐structured model is constructed, given a sample with only chemical measurements available, the predicted regional location of the sample is gradually restricted as it is passed through a series of classification steps. The geographic location of the sample can be predicted using a regression model within the terminal subregion. We show that the tree‐structured modeling approach provides reasonable estimates of geographical region and geographic location for surface water samples taken across the entire USA. Further, the location uncertainty, an estimate of a probability that a test sample could be located within a pre‐estimated, joint prediction interval that is much smaller than the terminal subregion, can also be assessed. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  A. Zellner Bayesian and Non-Bayesian Analysis of the Regression Model with Multivariate Student- t Error Terms , 1976 .

[2]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[3]  Yibin Ying,et al.  Application of probabilistic neural networks in qualitative analysis of near infrared spectra: determination of producing area and variety of loquats. , 2007, Analytica chimica acta.

[4]  João A. Lopes,et al.  Uncertainty assessment in FT-IR spectroscopy based bacteria classification models , 2008 .

[5]  Sam Yuan Sung,et al.  A hybrid EM approach to spatial clustering , 2006, Comput. Stat. Data Anal..

[6]  J. Ehleringer,et al.  Stable hydrogen and oxygen isotope ratios of bottled waters of the world. , 2005, Rapid communications in mass spectrometry : RCM.

[7]  Melba M. Crawford,et al.  Unsupervised multistage image classification using hierarchical clustering with a bayesian similarity measure , 2005, IEEE Transactions on Image Processing.

[8]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[9]  K. Varmuza,et al.  Multivariate models for the concentration of oxygen-18 in precipitation based on meteorological and geographical features , 2007 .

[10]  Michael C. Denham,et al.  Prediction intervals in partial least squares , 1997 .

[11]  David Casasent,et al.  New support vector-based design method for binary hierarchical classifiers for multi-class classification problems , 2008, Neural Networks.

[12]  Carolina Mendiguchía,et al.  Using chemometric tools to assess anthropogenic effects in river water. A case study: Guadalquivir River (Spain) , 2004 .

[13]  Andrea D. Magrì,et al.  Authentication of Italian CDO wines by class-modeling techniques , 2006 .

[14]  Xian-Shu Fu,et al.  Combining local wavelength information and ensemble learning to enhance the specificity of class modeling techniques: Identification of food geographical origins and adulteration. , 2012, Analytica chimica acta.

[15]  Gérard Govaert,et al.  Convergence of an EM-type algorithm for spatial clustering , 1998, Pattern Recognit. Lett..

[16]  A. D. Gordon A Review of Hierarchical Classification , 1987 .

[17]  C. Jun,et al.  Performance of some variable selection methods when multicollinearity is present , 2005 .

[18]  Donald E. Myers,et al.  Interpolation and estimation with spatially located data , 1991 .

[19]  Lutgarde M. C. Buydens,et al.  Possibilities of visible–near-infrared spectroscopy for the assessment of soil contamination in river floodplains , 2001 .

[20]  Saso Dzeroski,et al.  Hierarchical classification of diatom images using ensembles of predictive clustering trees , 2012, Ecol. Informatics.

[21]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[22]  Y. Heyden,et al.  Geographical classification of olive oils by the application of CART and SVM to their FT‐IR , 2007 .

[23]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[24]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[25]  G R Wood Confidence and prediction intervals for generalised linear accident models. , 2005, Accident; analysis and prevention.

[26]  J. Ehleringer,et al.  Stable isotope ratios of tap water in the contiguous United States , 2007 .

[27]  L. Sipos,et al.  Geographical origin identification of pure Sri Lanka tea infusions with electronic nose, electronic tongue and sensory profile analysis , 2010 .

[28]  Guonan Chen Assessment of environmental water with fuzzy cluster analysis and fuzzy recognition , 1993 .

[29]  Roy Brouwer,et al.  Nonmarket valuation of water quality in a rural transition economy in Turkey applying an a posteriori bid design , 2007 .

[30]  A. Malik,et al.  WATER QUALITY ASSESSMENT AND APPORTIONMENT OF POLLUTION SOURCES OF GOMTI RIVER(INDIA) USING MULTIVARIATE STATISTICAL TECHNIQUES- A CASE STUDY , 2005 .

[31]  Peter Filzmoser,et al.  Introduction to Multivariate Statistical Analysis in Chemometrics , 2009 .

[32]  M. Kulldorff,et al.  Breast cancer clusters in the northeast United States: a geographic analysis. , 1997, American journal of epidemiology.

[33]  Bogusław Buszewski,et al.  Application of chemometrics in river water classification. , 2006, Water research.

[34]  Ida Scheel,et al.  A Bayesian hierarchical model with spatial variable selection: the effect of weather on insurance claims , 2013, Journal of the Royal Statistical Society. Series C, Applied statistics.

[35]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[36]  Neil Davey,et al.  Hierarchical classification with a competitive evolutionary neural tree , 1999, Neural Networks.

[37]  Shashi Shekhar,et al.  Clustering and Information Retrieval , 2011, Network Theory and Applications.

[38]  D. O'Brien,et al.  Tracking human travel using stable oxygen and hydrogen isotope analyses of hair and urine. , 2007, Rapid communications in mass spectrometry : RCM.

[39]  S. Wold,et al.  PLS: Partial Least Squares Projections to Latent Structures , 1993 .

[40]  Sam Yuan Sung,et al.  Clustering spatial data with a hybrid EM approach , 2005, Pattern Analysis and Applications.

[41]  Paul J. Gemperline,et al.  Bootstrap methods for assessing the performance of near‐infrared pattern classification techniques , 2002 .

[42]  Pedro Larrañaga,et al.  Bayesian classifiers based on kernel density estimation: Flexible classifiers , 2009, Int. J. Approx. Reason..

[43]  Jeremy MG Taylor,et al.  Robust Statistical Modeling Using the t Distribution , 1989 .

[44]  Andrea D. Magrì,et al.  Supervised pattern recognition to authenticate Italian extra virgin olive oil varieties , 2004 .

[45]  Paul J. Gemperline,et al.  Multi-way analysis of trace elements in fish otoliths to track migratory patterns , 2002 .

[46]  Weili Wu,et al.  Clustering and Information Retrieval (Network Theory and Applications) , 2003 .

[47]  Kenji Mizuguchi,et al.  Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites , 2010, Bioinform..

[48]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[49]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  P. Ghosh,et al.  Tracing the source of bottled water using stable isotope techniques. , 2011, Rapid communications in mass spectrometry : RCM.

[51]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[52]  J. Gower Maximal predictive classification , 1974 .

[53]  I. Stanimirova,et al.  Environmetric modeling and interpretation of river water monitoring data , 2002, Analytical and bioanalytical chemistry.

[54]  Johanna Smeyers-Verbeke,et al.  Class modeling techniques in the control of the geographical origin of wines , 2009 .

[55]  Tevfik Aktekin,et al.  Bayesian spatial modeling of HIV mortality via zero‐inflated Poisson models , 2013, Statistics in medicine.

[56]  Shuo-sheng Wu,et al.  Urban land-use classification using variogram-based analysis with an aerial photograph , 2006 .

[57]  Rasmus Bro,et al.  Classification of GC‐MS measurements of wines by combining data dimension reduction and variable selection techniques , 2008 .

[58]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[59]  A simple spatio‐temporal procedure for the prediction of air pollution levels , 2002 .

[60]  J. Ehleringer,et al.  Hydrogen and oxygen isotope ratios in human hair are related to geography , 2008, Proceedings of the National Academy of Sciences.

[61]  D. Wardle,et al.  Spatial soil ecology , 2002 .

[62]  Lisa M. Ganio,et al.  A geostatistical approach for describing spatial pattern in stream networks , 2005 .

[63]  Tadayoshi Fushiki,et al.  Estimation of prediction error by using K-fold cross-validation , 2011, Stat. Comput..

[64]  Runze Li,et al.  Empirical Kriging models and their applications to QSAR , 2007 .

[65]  Dumitru Dumitrescu,et al.  A Fuzzy Hierarchical Classification System for Olfactory Signals , 2000, Pattern Analysis & Applications.

[66]  Bjørn-Helge Mevik,et al.  Mean squared error of prediction (MSEP) estimates for principal component regression (PCR) and partial least squares regression (PLSR) , 2004 .

[67]  R. Fovell,et al.  Climate zones of the conterminous United States defined using cluster analysis , 1993 .

[68]  Seockheon Lee,et al.  Chemometric application in classification and assessment of monitoring locations of an urban river system. , 2007, Analytica chimica acta.

[69]  M. P. Gómez-Carracedo,et al.  Screening oil spills by mid-IR spectroscopy and supervised pattern recognition techniques , 2012 .

[70]  C. Parker,et al.  Spatial distribution and cluster analysis of sexual risk behaviors reported by young men in Kisumu, Kenya , 2010, International journal of health geographics.

[71]  A. Sacco,et al.  Characterization of the geographical origin of Italian red wines based on traditional and nuclear magnetic resonance spectrometric determinations , 2002 .

[72]  Harry Zhang,et al.  Exploring Conditions For The Optimality Of Naïve Bayes , 2005, Int. J. Pattern Recognit. Artif. Intell..

[73]  Bruce R. Kowalski,et al.  prediction of wine quality and geographic origin from chemical measurements by parital least-squares regression modeling , 1984 .

[74]  Nancy B. Grimm,et al.  SPATIAL HETEROGENEITY OF STREAM WATER NUTRIENT CONCENTRATIONS OVER SUCCESSIONAL TIME , 1999 .

[75]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[76]  Diana Adler,et al.  Using Multivariate Statistics , 2016 .

[77]  Paul Geladi,et al.  Principles of Proper Validation: use and abuse of re‐sampling for validation , 2010 .

[78]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[79]  Bernd Droge,et al.  Bootstrap and Cross-Validation Estimates of the Prediction Error for Linear Regression Models , 1984 .

[80]  Sancho Salcedo-Sanz,et al.  Spatial regression analysis of NOx and O3 concentrations in Madrid urban area using Radial Basis Function networks , 2009 .

[81]  Gamal ElMasry,et al.  Prediction of some quality attributes of lamb meat using near-infrared hyperspectral imaging and multivariate analysis. , 2012, Analytica chimica acta.

[82]  Yubin Lan,et al.  Analysis of variograms with various sample sizes from a multispectral image , 2009 .

[83]  Michael C. Denham,et al.  Choosing the number of factors in partial least squares regression: estimating and minimizing the mean squared error­ of prediction , 2000 .

[84]  Yang Liu,et al.  An introduction to decision tree modeling , 2004 .

[85]  Superfund site characterization using non-parametric variogram modeling , 1993 .

[86]  Darinka Brodnjak-Vončina,et al.  Chemometrics characterisation of the quality of river water , 2002 .

[87]  Silvia Valero,et al.  Hyperspectral Image Representation and Processing With Binary Partition Trees , 2013, IEEE Transactions on Image Processing.