Data prevalence matters when assessing species' responses using data-driven species distribution models

Abstract The study of species' response is a key to understand the ecology of a species (e.g. critical habitat requirement and biological invasion processes) and design better conservation and management plans (e.g. problem identification, priority assessment and risk analysis). Predictive machine learning methods can be used as a tool for modeling species distributions as well as for describing important variables and specific habitat conditions required for a target species. This study aims (1) to demonstrate how habitat information such as species response curves can be retrieved from a species distribution model (SDM), (2) to assess the effects of data prevalence on model accuracy and habitat information retrieved from SDMs, and (3) to illustrate the differences between three data-driven methods, namely a fuzzy habitat suitability model (FHSM), random forests (RF) and support vector machines (SVMs). Nineteen sets of virtual species data with different data prevalences were generated using field-observed habitat conditions and hypothetical habitat suitability curves under four interaction scenarios governing the species–environment relationship for a virtual species. The effects of data prevalence on species distribution modeling were evaluated based on model accuracy and habitat information such as species response curves. Data prevalence affected both model accuracy and the assessment of species' response, with a stronger influence on the latter. The effects of data prevalence on model accuracy were less pronounced in the case of RF and SVMs which showed a higher performance. While the response curves were similar among the three models, data prevalence markedly affected the shapes of the response curves. Specifically, response curves obtained from a data set with higher prevalence showed higher tolerance to unsuitable habitat conditions, emphasizing the importance of accounting for data prevalence in the assessment of species–environment relationships. In a practical implementation of an SDM, data prevalence should be taken into account when interpreting the model results.

[1]  Kazuaki Hiramatsu,et al.  Fuzzy neural network model for habitat prediction and HEP for habitat quality estimation focusing on Japanese medaka (Oryzias latipes) in agricultural canals , 2006, Paddy and Water Environment.

[2]  L. Belbin,et al.  Evaluation of statistical models used for predicting plant species distributions: Role of artificial data and theory , 2006 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Shinji Fukuda,et al.  Assessing the applicability of fuzzy neural networks for habitat preference evaluation of Japanese medaka (Oryzias latipes) , 2011, Ecol. Informatics.

[5]  Kazuaki Hiramatsu,et al.  Prediction ability and sensitivity of artificial intelligence-based habitat preference models for predicting spatial distribution of Japanese medaka (Oryzias latipes) , 2008 .

[6]  B. Slabbinck,et al.  Towards large-scale FAME-based bacterial species identification using machine learning techniques. , 2009, Systematic and applied microbiology.

[7]  Ans Mouton,et al.  Ecological relevance of' performance criteria for species distribution models , 2010 .

[8]  K. Beard,et al.  Predicting the distribution potential of an invasive frog using remotely sensed data in Hawaii , 2012 .

[9]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[10]  Bernard De Baets,et al.  Interpretability-preserving genetic optimization of linguistic terms in fuzzy models for fuzzy ordered classification: An ecological case study , 2007, Int. J. Approx. Reason..

[11]  J. Elith,et al.  Species Distribution Models: Ecological Explanation and Prediction Across Space and Time , 2009 .

[12]  Eve McDonald-Madden,et al.  Predicting species distributions for conservation decisions , 2013, Ecology letters.

[13]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[14]  Rafael Pino-Mejías,et al.  Predicting the potential habitat of oaks with data mining models and the R system , 2010, Environ. Model. Softw..

[15]  P. I. Miller,et al.  Comparison of five modelling techniques to predict the spatial distribution and abundance of seabirds , 2012 .

[16]  Omri Allouche,et al.  Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS) , 2006 .

[17]  Antoine Guisan,et al.  Unifying niche shift studies: insights from biological invasions. , 2014, Trends in ecology & evolution.

[18]  Bernard Bobée,et al.  A review of statistical methods for the evaluation of aquatic habitat suitability for instream flow assessment , 2006 .

[19]  J. Peters,et al.  Random forests as a tool for ecohydrological distribution modelling , 2007 .

[20]  Michio Sugeno,et al.  Fuzzy identification of systems and its applications to modeling and control , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[21]  Hisao Ishibuchi,et al.  Application of parallel distributed genetics-based machine learning to imbalanced data sets , 2012, 2012 IEEE International Conference on Fuzzy Systems.

[22]  Ralf Wieland,et al.  Classification in conservation biology: A comparison of five machine-learning methods , 2010, Ecol. Informatics.

[23]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[24]  B. Baets,et al.  Effect of model formulation on the optimization of a genetic Takagi–Sugeno fuzzy system for fish habitat suitability evaluation , 2011 .

[25]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[26]  J. Lobo,et al.  The effect of prevalence and its interaction with sample size on the reliability of species distribution models , 2009 .

[27]  F. Jiguet,et al.  Selecting pseudo‐absences for species distribution models: how, where and how many? , 2012 .

[28]  Truly Santika Assessing the effect of prevalence on the predictive performance of species distribution models using simulated data , 2011 .

[29]  R. Meentemeyer,et al.  Equilibrium or not? Modelling potential distribution of invasive species in different stages of invasion , 2012 .

[30]  Damaris Zurell,et al.  Predicting to new environments: tools for visualizing model behaviour and impacts on mapped distributions , 2012 .

[31]  Sébastien Brosse,et al.  Dealing with Noisy Absences to Optimize Species Distribution Models: An Iterative Ensemble Modelling Approach , 2012, PloS one.

[32]  Alberto Jiménez-Valverde,et al.  The uncertain nature of absences and their importance in species distribution modelling , 2010 .

[33]  Inés Couso,et al.  Equalizing imbalanced imprecise datasets for genetic fuzzy classifiers , 2012, Int. J. Comput. Intell. Syst..

[34]  Bernard De Baets,et al.  Habitat prediction and knowledge extraction for spawning European grayling (Thymallus thymallus L.) using a broad range of species distribution models , 2013, Environ. Model. Softw..

[35]  John Bell,et al.  A review of methods for the assessment of prediction errors in conservation presence/absence models , 1997, Environmental Conservation.

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Russell G. Death,et al.  An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data , 2004 .

[38]  Bernard De Baets,et al.  Random forests as a tool for predictive ecohydrological modelling , 2005 .

[39]  Benoît Stoll,et al.  Support vector machines to map rare and endangered native plants in Pacific islands forests , 2012, Ecol. Informatics.

[40]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[41]  J. Elith,et al.  Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models , 2009 .

[42]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[43]  B. Baets,et al.  DO ABSENCE DATA MATTER WHEN MODELLING FISH HABITAT PREFERENCE USING A GENETIC TAKAGI-SUGENO FUZZY MODEL? , 2012 .

[44]  María José del Jesús,et al.  Improving the Performance of Fuzzy Rule Based Classification Systems for Highly Imbalanced Data-Sets Using an Evolutionary Adaptive Inference System , 2009, IWANN.