Sample size matters: investigating the effect of sample size on a logistic regression susceptibility model for debris flows

Abstract. Predictive spatial modelling is an important task in natural hazard assessment and regionalisation of geomorphic processes or landforms. Logistic regression is a multivariate statistical approach frequently used in predictive modelling; it can be conducted stepwise in order to select from a number of candidate independent variables those that lead to the best model. In our case study on a debris flow susceptibility model, we investigate the sensitivity of model selection and quality to different sample sizes in light of the following problem: on the one hand, a sample has to be large enough to cover the variability of geofactors within the study area, and to yield stable and reproducible results; on the other hand, the sample must not be too large, because a large sample is likely to violate the assumption of independent observations due to spatial autocorrelation. Using stepwise model selection with 1000 random samples for a number of sample sizes between n = 50 and n = 5000, we investigate the inclusion and exclusion of geofactors and the diversity of the resulting models as a function of sample size; the multiplicity of different models is assessed using numerical indices borrowed from information theory and biodiversity research. Model diversity decreases with increasing sample size and reaches either a local minimum or a plateau; even larger sample sizes do not further reduce it, and they approach the upper limit of sample size given, in this study, by the autocorrelation range of the spatial data sets. In this way, an optimised sample size can be derived from an exploratory analysis. Model uncertainty due to sampling and model selection, and its predictive ability, are explored statistically and spatially through the example of 100 models estimated in one study area and validated in a neighbouring area: depending on the study area and on sample size, the predicted probabilities for debris flow release differed, on average, by 7 to 23 percentage points. In view of these results, we argue that researchers applying model selection should explore the behaviour of the model selection for different sample sizes, and that consensus models created from a number of random samples should be given preference over models relying on a single sample.

[1]  E. H. Simpson Measurement of Diversity , 1949, Nature.

[2]  P. J. Clark,et al.  Distance to Nearest Neighbor as a Measure of Spatial Relationships in Populations , 1954 .

[3]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4]  R. W. Thomas,et al.  Information statistics in geography , 1981 .

[5]  Tamotsu Takahashi,et al.  ESTIMATION OF POTENTIAL DEBRIS FLOWS AND THEIR HAZARDOUS ZONES : SOFT COUNTERMEASURES FOR A DISASTER , 1981 .

[6]  C. Thorne,et al.  Quantitative analysis of land surface topography , 1987 .

[7]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[8]  T. G. Freeman,et al.  Calculating catchment area with divergent flow based on a regular grid , 1991 .

[9]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[10]  S. Green How Many Subjects Does It Take To Do A Regression Analysis. , 1991, Multivariate behavioral research.

[11]  I. Moore,et al.  Digital terrain modelling: A review of hydrological, geomorphological, and biological applications , 1991 .

[12]  P. Legendre Spatial Autocorrelation: Trouble or New Paradigm? , 1993 .

[13]  Markus N. Zimmermann,et al.  The 1987 debris flows in Switzerland: documentation and analysis , 1993 .

[14]  E. Foufoula‐Georgiou,et al.  Channel network source representation using digital elevation models , 1993 .

[15]  M. Becht Untersuchungen zur aktuellen Reliefentwicklung in alpinen Einzugsgebieten , 1995 .

[16]  Hans C. Jessen,et al.  Applied Logistic Regression Analysis , 1996 .

[17]  G. Bonham-Carter Geographic Information Systems for Geoscientists , 1996 .

[18]  P. Mani,et al.  Murganggefahr und Klimaänderung - ein GIS-basierter Ansatz , 1997 .

[19]  F. Pergalani,et al.  Slope Instability Zonation: a Comparison Between Certainty Factor and Fuzzy Dempster–Shafer Approaches , 1998 .

[20]  P. Atkinson,et al.  GENERALIZED LINEAR MODELLING IN GEOMORPHOLOGY , 1998 .

[21]  Matthias Jakob,et al.  The role of debris supply conditions in predicting debris flow activity , 1999 .

[22]  Einflüsse von Niederschlag und Substrat auf die Murauslösung in Beispielgebieten der Ostalpen , 2000 .

[23]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[24]  J. Corominas,et al.  Assessment of shallow landslide susceptibility by means of multivariate statistical techniques , 2001 .

[25]  R. Lark,et al.  Geostatistics for Environmental Scientists , 2001 .

[26]  R. Reese Geostatistics for Environmental Scientists , 2001 .

[27]  S. Beguería,et al.  Landslide hazard mapping by multivariate statistics: comparison of methods and case study in the Spanish Pyrenees , 2002 .

[28]  David R. B. Stockwell,et al.  Effects of sample size on accuracy of species distribution models , 2002 .

[29]  Fausto Guzzetti,et al.  Impact of mapping errors on the reliability of landslide hazard maps , 2002 .

[30]  A. Magurran,et al.  Measuring Biological Diversity , 2004 .

[31]  Chang-Jo Chung,et al.  Is Prediction of Future Landslides Possible with a GIS? , 2003 .

[32]  Saro Lee,et al.  Landslide susceptibility analysis using GIS and artificial neural network , 2003 .

[33]  John C. Davis,et al.  Using multiple logistic regression and GIS technology to predict landslide hazard in northeast Kansas, USA , 2003 .

[34]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .

[35]  A Simple GIS Model for Mapping Landslide Susceptibility , 2003 .

[36]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[37]  Andrea G. Fabbri,et al.  Validation of Spatial Prediction Models for Landslide Hazard Mapping , 2003 .

[38]  L. Ayalew,et al.  The application of GIS-based logistic regression for landslide susceptibility mapping in the Kakuda-Yahiko Mountains, Central Japan , 2005 .

[39]  H. Wang,et al.  Comparative evaluation of landslide susceptibility in Minamata area, Japan , 2005 .

[40]  Alexander Brenning,et al.  Sampling and statistical analyses of BTS measurements , 2005 .

[41]  L. Ermini,et al.  Artificial Neural Networks applied to landslide susceptibility assessment , 2005 .

[42]  A. Brenning Spatial prediction models for landslide hazards: review, comparison and evaluation , 2005 .

[43]  L. Miska,et al.  Evaluation of current statistical approaches for predictive geomorphological mapping , 2005 .

[44]  Santiago Beguería,et al.  Validation and Evaluation of Predictive Models in Hazard Assessment and Risk Management , 2006 .

[45]  A. C. Seijmonsbergen,et al.  Expert-driven semi-automated geomorphological mapping for a mountainous area using a laser DTM , 2006 .

[46]  M. Eeckhaut,et al.  Prediction of landslide susceptibility using rare events logistic regression: A case-study in the Flemish Ardennes (Belgium) , 2006 .

[47]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[48]  P. Reichenbach,et al.  Estimating the quality of landslide susceptibility models , 2006 .

[49]  S. Beguería,et al.  Changes in land cover and shallow landslide activity: a case study in the Spanish Pyrenees , 2006 .

[50]  P. Reichenbach,et al.  Landslide hazard assessment in the Collazzone area, Umbria, Central Italy , 2006 .

[51]  Yong Liu,et al.  Neural network modeling for regional hazard assessment of debris flow in Lake Qionghai Watershed, China , 2006 .

[52]  Birgit Terhorst,et al.  Landslide susceptibility assessment using “weights-of-evidence” applied to a study area at the Jurassic escarpment (SW-Germany) , 2007 .

[53]  J. M. Sappington,et al.  Quantifying Landscape Ruggedness for Animal Habitat Analysis: A Case Study Using Bighorn Sheep in the Mojave Desert , 2007 .

[54]  R. O’Brien,et al.  A Caution Regarding Rules of Thumb for Variance Inflation Factors , 2007 .

[55]  P. Magliulo,et al.  Geomorphology and landslide susceptibility assessment using GIS and bivariate statistics: a case study in southern Italy , 2008 .

[56]  David A. Kinner,et al.  Initiation conditions for debris flows generated by runoff at Chalk Cliffs, central Colorado , 2008 .

[57]  Mathieu Marmion,et al.  Effects of sample size on the accuracy of geomorphological models , 2008 .

[58]  Tobias Heckmann,et al.  Sediment budget and morphodynamics of an alpine talus cone on different timescales , 2008 .

[59]  M. Eeckhaut,et al.  Spatial analysis of factors controlling the presence of closed depressions and gullies under forest: Application of rare event logistic regression , 2008 .

[60]  Susan Ivy-Ochs,et al.  Chronology of the last glacial cycle in the European Alps , 2008 .

[61]  Eric Bardou,et al.  Debris flow susceptibility mapping at a regional scale , 2008 .

[62]  P. Frattini,et al.  Comparing models of debris-flow susceptibility in the alpine environment , 2008 .

[63]  Mathieu Marmion,et al.  A comparison of predictive methods in modelling the distribution of periglacial landforms in Finnish Lapland , 2008 .

[64]  Katrin Meusburger,et al.  On the influence of temporal change on the validity of landslide susceptibility maps in an alpine catchment, Switzerland , 2009 .

[65]  P. Reichenbach,et al.  Combined landslide inventory and susceptibility assessment based on different mapping units: an example from the Flemish Ardennes, Belgium , 2009 .

[66]  M. Becht,et al.  INVESTIGATING THE TRANSFERABILITY OF STATISTICAL DISPOSITION MODELS FOR SLOPE-TYPE DEBRIS FLOWS , 2009 .

[67]  Wilfried Thuiller,et al.  Statistical consensus methods for improving predictive geomorphology maps , 2009, Comput. Geosci..

[68]  Michael Becht,et al.  A new modelling approach to delineate the spatial extent of alpine sediment cascades : GIS and SDA applications in geomorphology , 2009 .

[69]  Vincent Calcagno,et al.  glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models , 2010 .

[70]  Mathieu Marmion,et al.  Assessing spatial uncertainty in predictive geomorphological mapping: A multi-modelling approach , 2010, Comput. Geosci..

[71]  Sanford Weisberg,et al.  An R Companion to Applied Regression , 2010 .

[72]  Simone Sterlacchini,et al.  Debris flow hazard modelling on medium scale: Valtellina di Tirano, Italy , 2010, Natural Hazards and Earth System Sciences.

[73]  P. Reichenbach,et al.  Optimal landslide susceptibility zonation based on multiple forecasts , 2010 .

[74]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[75]  M. Eeckhaut,et al.  Comparison of two landslide susceptibility assessments in the Champagne-Ardenne region (France). , 2010 .

[76]  C. Westen,et al.  Analysis of landslide inventories for accurate prediction of debris-flow source areas. , 2010 .

[77]  Peter M. Atkinson,et al.  Autologistic modelling of susceptibility to landsliding in the Central Apennines, Italy , 2011 .

[78]  J. Malet,et al.  Assessment of debris-flow susceptibility at medium-scale in the Barcelonnette Basin, France , 2011 .

[79]  Peter Lehmann,et al.  Spatial statistical modeling of shallow landslides—Validating predictions for different landslide inventories and rainfall events , 2011 .

[80]  M. Luoto,et al.  Novel theoretical insights into geomorphic process–environment relationships using simulated response curves , 2011 .

[81]  Wolfgang Schwanghart,et al.  Fuzzy delineation of drainage basins through probabilistic interpretation of diverging flow algorithms , 2012, Environ. Model. Softw..

[82]  F. Guzzetti,et al.  Landslide inventory maps: New tools for an old problem , 2012 .

[83]  M. Jaboyedoff,et al.  Debris flow modeling for susceptibility mapping at regional to national scale in Norway , 2012 .

[84]  H. Petschko,et al.  Assessment of landslide age, landslide persistence and human impact using airborne laser scanning digital terrain models , 2012 .

[85]  Veerle Vanacker,et al.  Logistic regression applied to natural hazards: rare event logistic regression with replications , 2012 .

[86]  Boris Schröder,et al.  How can statistical models help to determine driving factors of landslides , 2012 .

[87]  T. Heckmann,et al.  Geomorphic coupling and sediment connectivity in an alpine catchment - exploring sediment cascades using graph theory , 2013 .

[88]  Quantifizierung der Konnektivität von Sedimentkaskaden in alpinen Geosystemen , 2013 .

[89]  A. Brenning,et al.  Assessing the quality of landslide susceptibility maps – case study Lower Austria , 2014 .