Bagging GLM: Improved generalized linear model for the analysis of zero-inflated data

article i nfo Species-occurrence data sets tend to contain a large proportion of zero values, i.e., absence values (zero- inflated). Statistical inference using such data sets is likely to be inefficient or lead to incorrect conclusions unless the data are treated carefully. In this study, we propose a new modeling method to overcome the problems caused by zero-inflated data sets that involves a regression model and a machine-learning technique. We combined a generalized liner model (GLM), which is widely used in ecology, and bootstrap aggregation (bagging), a machine-learning technique. We established distribution models of Vincetoxicum pycnostelma (a vascular plant) and Ninox scutulata (an owl), both of which are endangered and have zero- inflated distribution patterns, using our new method and traditional GLM and compared model performances. At the same time we modeled four theoretical data sets that contained different ratios of presence/absence values using new and traditional methods and also compared model performances. For distribution models, our new method showed good performance compared to traditional GLMs. After bagging, area under the curve (AUC) values were almost the same as with traditional methods, but sensitivity values were higher. Additionally, our new method showed high sensitivity values compared to the traditional GLM when modeling a theoretical data set containing a large proportion of zero values. These results indicate that our new method has high predictive ability with presence data when analyzing zero-inflated data sets. Generally, predicting presence data is more difficult than predicting absence data. Our new modeling method has potential for advancing species distribution modeling.

[1]  J. Rodríguez,et al.  The application of predictive modelling of species distribution to biodiversity conservation , 2007 .

[2]  C. Capinha,et al.  Assessing the environmental requirements of invaders using ensembles of distribution models , 2011 .

[3]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[4]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[5]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[6]  Antoine Guisan,et al.  Are niche-based species distribution models transferable in space? , 2006 .

[7]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[8]  Julian D Olden,et al.  Machine Learning Methods Without Tears: A Primer for Ecologists , 2008, The Quarterly Review of Biology.

[9]  Mark S. Boyce,et al.  Modelling distribution and abundance with presence‐only data , 2006 .

[10]  Hugh P Possingham,et al.  Zero tolerance ecology: improving ecological inference by modelling the source of zero observations. , 2005, Ecology letters.

[11]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[12]  A. Peterson,et al.  New developments in museum-based informatics and applications in biodiversity analysis. , 2004, Trends in ecology & evolution.

[13]  W. Thuiller,et al.  Predicting species distribution: offering more than simple habitat models. , 2005, Ecology letters.

[14]  S. Weiss,et al.  GLM versus CCA spatial modeling of plant species distribution , 1999, Plant Ecology.

[15]  Jane Elith,et al.  Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines , 2007 .

[16]  N. Thompson Hobbs,et al.  Challenges and opportunities in integrating ecological knowledge across scales , 2003 .

[17]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[18]  Robert P Guralnick,et al.  Towards a collaborative, global infrastructure for biodiversity assessment , 2007, Ecology letters.

[19]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.

[20]  F. Koike,et al.  Predicting future invasion of an invasive alien tree in a Japanese oceanic island by process-based statistical models using recent distribution maps , 2009, Ecological Research.

[21]  Jeffrey D. Lozier,et al.  Patterns of widespread decline in North American bumble bees , 2011, Proceedings of the National Academy of Sciences.

[22]  K. Poortema,et al.  On modelling overdispersion of counts , 1999 .

[23]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[24]  Mathieu Marmion,et al.  Evaluation of consensus methods in predictive species distribution modelling , 2009 .

[25]  David Fletcher,et al.  Modelling skewed data with many zeros: A simple approach combining ordinary and logistic regression , 2005, Environmental and Ecological Statistics.

[26]  Miguel B. Araújo,et al.  Systematic Conservation Planning Comes of Age , 2009 .

[27]  M. Araújo,et al.  Presence-absence versus presence-only modelling methods for predicting bird habitat suitability , 2004 .

[28]  David R. B. Stockwell,et al.  The GARP modelling system: problems and solutions to automated spatial prediction , 1999, Int. J. Geogr. Inf. Sci..

[29]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[30]  Jane Elith,et al.  The evaluation strip: A new and robust method for plotting predicted responses from species distribution models , 2005 .

[31]  Hiromune Mitsuhashi,et al.  Abandonment and intensified use of agricultural land decrease habitats of rare herbs in semi-natural grasslands , 2010 .

[32]  Mark New,et al.  Ensemble forecasting of species distributions. , 2007, Trends in ecology & evolution.

[33]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[34]  Zero‐Inflated Discrete Statistical Models for Fecundity Data Analysis in Channel Catfish, Ictalurus punctatus , 2007 .

[35]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[36]  C. Sutton Classification and Regression Trees, Bagging, and Boosting , 2005 .

[37]  John Hinde,et al.  Overdispersion: models and estimation , 1998 .