How to make more from exposure data? An integrated machine learning pipeline to predict pathogen exposure.

1.Predicting infectious disease dynamics is a central challenge in disease ecology. Models that can assess which individuals are most at risk of being exposed to a pathogen not only provide valuable insights into disease transmission and dynamics but can also guide management interventions. Constructing such models for wild animal populations, however, is particularly challenging; often only serological data is available on a subset of individuals and non-linear relationships between variables are common. 2.Here we provide a guide to the latest advances in statistical machine learning to construct pathogen-risk models that automatically incorporate complex non-linear relationships with minimal statistical assumptions from ecological data with missing data. Our approach compares multiple machine learning algorithms in a unified environment to find the model with the best predictive performance and uses game theory to better interpret results. We apply this framework on two major pathogens that infect African lions: canine distemper virus (CDV) and feline parvovirus. 3.Our modelling approach provided enhanced predictive performance compared to more traditional approaches, as well as new insights into disease risks in a wild population. We were able to efficiently capture and visualise strong non-linear patterns, as well as model complex interactions between variables in shaping exposure risk from CDV and feline parvovirus. For example, we found that lions were more likely to be exposed to CDV at a young age but only in low rainfall years. 4.When combined with our data calibration approach, our framework helped us to answer questions about risk of pathogen exposure which are difficult to address with previous methods. Our framework not only has the potential to aid in predicting disease risk in animal populations, but also can be used to build robust predictive models suitable for other ecological applications such as modelling species distribution or diversity patterns. This article is protected by copyright. All rights reserved.

[1]  Mary Poss,et al.  Social Organization and Parasite Risk in Mammals: Integrating Theory and Empirical Studies , 2003 .

[2]  V. Ezenwa Host social behavior and parasitic infection: a multifactorial approach , 2004 .

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  J V Tu,et al.  Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. , 1996, Journal of clinical epidemiology.

[5]  Shinichi Nakagawa,et al.  Missing inaction: the dangers of ignoring missing data. , 2008, Trends in ecology & evolution.

[6]  C. Packer,et al.  Endemic infection can shape exposure to novel pathogens: Pathogen co‐occurrence networks in the Serengeti lions , 2019, Ecology letters.

[7]  A. Gopalaswamy,et al.  Site‐occupancy modelling as a novel framework for assessing test sensitivity and estimating wildlife disease prevalence from imperfect diagnostic tests , 2012 .

[8]  M. Burgman,et al.  Species distribution models: A comparison of statistical approaches for livestock and disease epidemics , 2017, PloS one.

[9]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[10]  B. Young,et al.  Imputation of missing data in life‐history trait datasets: which approach performs the best? , 2014 .

[11]  C. Packer,et al.  Dynamics of a morbillivirus at the domestic–wildlife interface: Canine distemper virus in domestic dogs and lions , 2015, Proceedings of the National Academy of Sciences.

[12]  C. Packer,et al.  Transmission ecology of canine parvovirus in a multi-host, multi-pathogen system , 2019, Proceedings of the Royal Society B.

[13]  A. Diez-Roux,et al.  Bringing context back into epidemiology: variables and fallacies in multilevel analysis. , 1998, American journal of public health.

[14]  Damaris Zurell,et al.  Collinearity: a review of methods to deal with it and a simulation study evaluating their performance , 2013 .

[15]  M. Lipsitch,et al.  Temporally Varying Relative Risks for Infectious Diseases: Implications for Infectious Disease Control. , 2017, Epidemiology.

[16]  Megan K. Jennings,et al.  Pathogen exposure varies widely among sympatric populations of wild and domestic felids across the United States. , 2016, Ecological applications : a publication of the Ecological Society of America.

[17]  Anne-Laure Boulesteix,et al.  A computationally fast variable importance test for random forests for high-dimensional data , 2015, Adv. Data Anal. Classif..

[18]  C. Packer,et al.  A canine distemper virus epidemic in Serengeti lions (Panthera leo) , 1996, Nature.

[19]  C. Packer,et al.  FIV diversity: FIV Ple subtype composition may influence disease outcome in African lions. , 2011, Veterinary immunology and immunopathology.

[20]  Bogdan E. Popescu,et al.  PREDICTIVE LEARNING VIA RULE ENSEMBLES , 2008, 0811.1679.

[21]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[22]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[23]  Ferenc Jordán,et al.  Infectious disease and group size: more than just a numbers game , 2015, Philosophical Transactions of the Royal Society B: Biological Sciences.

[24]  Erik Strumbelj,et al.  Explaining prediction models and individual predictions with feature contributions , 2014, Knowledge and Information Systems.

[25]  Barbara A. Han,et al.  Undiscovered Bat Hosts of Filoviruses , 2016, PLoS neglected tropical diseases.

[26]  Bernd Bischl,et al.  iml: An R package for Interpretable Machine Learning , 2018, J. Open Source Softw..

[27]  J. Friedman Stochastic gradient boosting , 2002 .

[28]  C. Corzo,et al.  Identifying outbreaks of Porcine Epidemic Diarrhea virus through animal movements and spatial neighborhoods , 2019, Scientific Reports.

[29]  Mathieu Marmion,et al.  Evaluation of consensus methods in predictive species distribution modelling , 2009 .

[30]  R. Plowright,et al.  Deciphering Serology to Understand the Ecology of Infectious Diseases in Wildlife , 2013, EcoHealth.

[31]  C. Packer,et al.  Prevalence of antibodies to feline parvovirus, calicivirus, herpesvirus, coronavirus, and immunodeficiency virus and of feline leukemia virus antigen and the interrelationship of these viral infections in free-ranging lions in east Africa , 1996, Clinical and diagnostic laboratory immunology.

[32]  N. A. Khovanova,et al.  Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation , 2017, Biomed. Signal Process. Control..

[33]  D G Denison,et al.  Bayesian Partitioning for Estimating Disease Risk , 2001, Biometrics.

[34]  R M May,et al.  Age-related changes in the rate of disease transmission: implications for the design of vaccination programmes , 1985, Journal of Hygiene.

[35]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[36]  Dubravko Culibrk,et al.  Unveiling Spatial Epidemiology of HIV with Mobile Phone Data , 2015, Scientific Reports.

[37]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[38]  Craig Packer,et al.  Climate Extremes Promote Fatal Co-Infections during Canine Distemper Epidemics in African Lions , 2008, PloS one.

[39]  C. Packer,et al.  Viruses of the Serengeti: patterns of infection and mortality in African lions , 1999, The Journal of Animal Ecology.

[40]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[41]  Hans-Peter Piepho,et al.  A comparison of random forests, boosting and support vector machines for genomic selection , 2011, BMC proceedings.

[42]  Emil Pitkin,et al.  Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation , 2013, 1309.6392.

[43]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[44]  Y. Ho,et al.  Simple Explanation of the No-Free-Lunch Theorem and Its Implications , 2002 .

[45]  Richard J. Orton,et al.  Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes , 2018, Science.