Classification and regression with random forests as a standard method for presence-only data SDMs: A future conservation example using China tree species

Abstract The random forests (RF) algorithm is a superb learner and classifier in machine learning applications. This ensemble model is also one of the most popular species distribution model algorithms (SDMs) available to date. RF by default can produce categorical and numerical species distribution maps based on its classification tree (CT) and regression tree (RT) algorithms, respectively. Statistically, CT can also produce numerical predictions (class probability). Many real-world applications (e.g. conservation planning) employ binary presence–absence outputs that use classification thresholds to make these conversions. However, there is little available information regarding the difference in model performance between CT and RT for inference settings. Here, under an ensemble modeling framework, 52 forest tree species with presence-only data for all of China were selected for comparison of the performance of CT and RT algorithms in projecting the distribution and potential range shifts of these species under current and future climates. Five climatic variables were used to develop CT and RT models. Eight threshold-setting approaches were employed to convert numerical predictions into binary predictions. With regard to probabilistic predictions, the relative performance of CT and RT depended on the choice of the evaluation criteria. For both RT and CT, threshold-setting methods significantly altered the determination of thresholds, model performance, and subsequently projections of species range shifts under climate change. The four threshold selection methods (MaxKappa, MaxOA, MaxTSS, and MinROCdist) based on the composite model accuracy measures most often achieved significantly higher model performance than CT default threshold method and other threshold methods. They consistently projected that species' geographical ranges changed in response to climate change with the same direction and magnitude. We argue for choosing RT rather than CT as the SDM if model discrimination capacity (the ability to differentiate between occurrences of presence and absence) is viewed as more important than model reliability (the agreement between predicted relative indexes of occurrence and observed proportions of occurrence), and vice versa. In line with gradient theory, we can recommend the use of numerical predictions for species distribution modeling since they help to convey more information than binary predictions. Binary conversion of model outputs should only be carried out when it is clearly justified by the application's objective. The four aforementioned threshold methods are promising objective methods for binary conversions of continuous predictions when presence-only data are available. This study proposes guidelines on how machine learning can be used for specific applied and theoretical applications in a SDM context.

[1]  A. Peterson,et al.  New developments in museum-based informatics and applications in biodiversity analysis. , 2004, Trends in ecology & evolution.

[2]  Li Wenhua,et al.  Degradation and restoration of forest ecosystems in China , 2004 .

[3]  C. Meynard,et al.  Using virtual species to study species distributions and model performance , 2013 .

[4]  R. Hijmans,et al.  Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model. , 2012, Ecology.

[5]  Brendan A. Wintle,et al.  Is my species distribution model fit for purpose? Matching data and models to applications , 2015 .

[6]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[7]  Kerry A. Naish,et al.  A practical introduction to Random Forest for genetic association studies in ecology and evolution , 2018, Molecular ecology resources.

[8]  M. Araújo,et al.  BIOMOD – a platform for ensemble forecasting of species distributions , 2009 .

[9]  Carsten F. Dormann,et al.  Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure , 2017 .

[10]  G. De’ath,et al.  CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS , 2000 .

[11]  F. Huettmann,et al.  Conservation prioritization with machine learning predictions for the black-necked crane Grus nigricollis, a flagship species on the Tibetan Plateau for 2070 , 2018, Regional Environmental Change.

[12]  Conghe Song,et al.  Forest Cover in China from 1949 to 2006 , 2009 .

[13]  Falk Huettmann,et al.  Rapid multi-nation distribution assessment of a charismatic conservation species using open access ensemble model GIS predictions: Red panda (Ailurus fulgens) in the Hindu-Kush Himalaya region , 2015 .

[14]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.

[15]  Zeinab Jafarian,et al.  Which spatial distribution model best predicts the occurrence of dominant species in semi-arid rangeland of northern Iran? , 2019, Ecol. Informatics.

[16]  Falk Huettmann,et al.  First open access ensemble climate envelope predictions of Assamese macaque Macaca assamensis in Asia: a new role model and assessment of endangered species , 2018, Endangered Species Research.

[17]  N. Crossman,et al.  China’s response to a national land-system sustainability emergency , 2018, Nature.

[18]  Simon Ferrier,et al.  Evaluating the predictive performance of habitat models developed using logistic regression , 2000 .

[19]  N. Zimmermann,et al.  Topo‐climatic microrefugia explain the persistence of a rare endemic plant in the Alps during the last 21 millennia , 2014, Global change biology.

[20]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[21]  W. Feller,et al.  An Introduction to Probability Theory and Its Application. , 1951 .

[22]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[23]  Y. Wiersma,et al.  Predictive species and habitat modeling in landscape ecology : concepts and applications , 2011 .

[24]  Jorge Soberón,et al.  Creating individual accessible area hypotheses improves stacked species distribution model performance , 2018 .

[25]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[26]  P. Leitão,et al.  Mapping seasonal European bison habitat in the Caucasus Mountains to identify potential reintroduction sites , 2015 .

[27]  J. Abatzoglou,et al.  Changes in Climatic Water Balance Drive Downhill Shifts in Plant Species’ Optimum Elevations , 2011, Science.

[28]  T. Dawson,et al.  Selecting thresholds of occurrence in the prediction of species distributions , 2005 .

[29]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[30]  M. White,et al.  Measuring and comparing the accuracy of species distribution models with presence–absence data , 2011 .

[31]  Omri Allouche,et al.  Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS) , 2006 .

[32]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[33]  Antoine Guisan,et al.  Climatic Niche Shifts Are Rare Among Terrestrial Plant Invaders , 2012, Science.

[34]  John Bell,et al.  A review of methods for the assessment of prediction errors in conservation presence/absence models , 1997, Environmental Conservation.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[37]  Keiko A. Herrick Predictive Modeling of Avian Influenza in Wild Birds , 2013 .

[38]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[39]  Leon C. Hinz,et al.  Using Maxent to model the historic distributions of stonefly species in Illinois streams: The effects of regularization and threshold selections , 2013 .

[40]  J. Morton,et al.  Expansion of American marten (Martes americana) distribution in response to climate and landscape change on the Kenai Peninsula, Alaska , 2017, Journal of Mammalogy.

[41]  J. Lobo,et al.  Threshold criteria for conversion of probability of species presence to either–or presence–absence , 2007 .

[42]  David S. L. Ramsey,et al.  Foxes are now widespread in Tasmania: DNA detection defines the distribution of this rare but invasive carnivore , 2013 .

[43]  J. Peters,et al.  Random forests as a tool for ecohydrological distribution modelling , 2007 .

[44]  M. White,et al.  On the selection of thresholds for predicting species occurrence with presence‐only data , 2015, Ecology and evolution.

[45]  Mark New,et al.  Ensemble forecasting of species distributions. , 2007, Trends in ecology & evolution.

[46]  C.J.F. ter Braak,et al.  A Theory of Gradient Analysis , 2004 .

[47]  A. Lehmann,et al.  Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns , 2002 .

[48]  Falk Huettmann,et al.  Spatial complexity, informatics, and wildlife conservation , 2010 .

[49]  Charles J. Marsh,et al.  Accounting for biotic interactions through alpha‐diversity constraints in stacked species distribution models , 2017 .

[50]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[51]  M. Araújo,et al.  Choice of threshold alters projections of species range shifts under climate change , 2011 .

[52]  J. Evans,et al.  Modeling Species Distribution and Change Using Random Forest , 2011 .

[53]  Trevor H. Booth,et al.  bioclim: the first species distribution modelling package, its early applications and relevance to most current MaxEnt studies , 2014 .

[54]  Gretchen G. Moisen,et al.  A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and Kappa , 2008 .

[55]  张雷 Zhang Lei,et al.  The basic principle of random forest and its applications in ecology: a case study of Pinus yunnanensis , 2014 .

[56]  M. White,et al.  Selecting thresholds for the prediction of species occurrence with presence‐only data , 2013 .

[57]  F. Jiguet,et al.  Selecting pseudo‐absences for species distribution models: how, where and how many? , 2012 .

[58]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[59]  D. Norris,et al.  Model Thresholds are More Important than Presence Location Type: Understanding the Distribution of Lowland tapir (Tapirus Terrestris) in a Continuous Atlantic Forest of Southeast Brazil , 2014 .

[60]  Á. Felicísimo,et al.  Profile or group discriminative techniques? Generating reliable species distribution models using pseudo‐absences and target‐group absences from natural history collections , 2010 .

[61]  Guangyu Wang,et al.  Using DEM to predict Abies faxoniana and Quercus aquifolioides distributions in the upstream catchment basin of the Min River in southwest China , 2016 .

[62]  Xuesong Han,et al.  Why choose Random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence , 2017, PeerJ.

[63]  Antoine Guisan,et al.  How much should one sample to accurately predict the distribution of species assemblages? A virtual community approach , 2018, Ecol. Informatics.

[64]  Falk Huettmann,et al.  On open access, data mining and plant conservation in the Circumpolar North with an online data example of the Herbarium, University of Alaska Museum of the North , 2017, Arctic Science.

[65]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[66]  P. I. Miller,et al.  Comparison of five modelling techniques to predict the spatial distribution and abundance of seabirds , 2012 .