A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices

Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results. Author summary Genetic epidemiology studies of malaria attempt to characterise what is happening in malaria parasite populations. In particular, they are an important tool to track the spread of drug resistance and to validate genetic makers of drug resistance. To make sense of parasite genetic data, researchers usually characterise the population structure using statistical methods. This is most often done as a two step process. The first is a data reduction step, whereby the data are summarised into a distance matrix (each entry represents the genetic distance between two isolates) and then the distance matrix is input into an unsupervised machine learning algorithm. Principal coordinates analysis and hierarchical agglomerative clustering are the two most popular unsupervised machine learning algorithms used for this purpose in malaria genetic epidemiology. We illustrate that this procedure is sensitive to the choice of genetic distance and to the specification of the algorithms. These unsupervised methods are useful for exploratory data analysis but cannot be used to infer historical events. We provide some guidance on how to make genetic epidemiology analyses more transparent and reproducible.

[1]  Robert W. Murphy,et al.  Median-joining network analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary , 2020, Proceedings of the National Academy of Sciences.

[2]  D. Kwiatkowski,et al.  Spread of artemisinin resistance in Plasmodium falciparum malaria. , 2014, The New England journal of medicine.

[3]  W. S. Robinson A Method for Chronologically Ordering Archaeological Deposits , 1951, American Antiquity.

[4]  Sungsik Kong,et al.  On the use of median‐joining networks in evolutionary biology , 2016, Cladistics : the international journal of the Willi Hennig Society.

[5]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[6]  David Wakeham,et al.  XIBD: software for inferring pairwise identity by descent on the X chromosome , 2016, Bioinform..

[7]  M. Pirinen,et al.  The fine-scale genetic structure of the British population , 2015, Nature.

[8]  Tal Galili,et al.  dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering , 2015, Bioinform..

[9]  Anita Ghansah,et al.  Major subpopulations of Plasmodium falciparum in sub-Saharan Africa , 2019, Science.

[10]  Gil McVean,et al.  The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria , 2018, bioRxiv.

[11]  S. Myers,et al.  A method for genome-wide genealogy estimation for thousands of samples , 2019, Nature Genetics.

[12]  E C Anderson,et al.  The influence of family groups on inferences made with the program Structure , 2008, Molecular ecology resources.

[13]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[14]  S. Schaffner,et al.  Modeling malaria genomics reveals transmission decline and rebound in Senegal , 2015, Proceedings of the National Academy of Sciences.

[15]  Philip B. Stark,et al.  Cargo‐cult statistics and scientific crisis , 2018, Significance.

[16]  Caroline O Buckee,et al.  Mapping malaria by combining parasite genomic and epidemiologic data , 2018, bioRxiv.

[17]  Axel Munk,et al.  Testing for dependence on tree structures , 2020, Proceedings of the National Academy of Sciences.

[18]  Susana Campino,et al.  Genomic analysis of a pre-elimination Malaysian Plasmodium vivax population reveals selective pressures and changing transmission dynamics , 2018, Nature Communications.

[19]  Caroline O Buckee,et al.  Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent , 2017, PLoS genetics.

[20]  Daniel H. Huson,et al.  Tanglegrams for rooted phylogenetic trees and networks , 2011, Bioinform..

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[22]  M W Feldman,et al.  Statistics for microsatellite variation based on coalescence. , 1996, Theoretical population biology.

[23]  Selina Bopp,et al.  De Novo Mutations Resolve Disease Transmission Pathways in Clonal Malaria , 2017, bioRxiv.

[24]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[25]  Gilean McVean,et al.  Multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia , 2013, Nature Genetics.

[26]  Jim Stalker,et al.  Origins of the current outbreak of multidrug-resistant malaria in southeast Asia: a retrospective genetic study , 2017, bioRxiv.

[27]  Daniel R. Schrider,et al.  Supervised Machine Learning for Population Genetics: A New Paradigm , 2018, Trends in genetics : TIG.

[28]  D. Falush,et al.  Inference of Population Structure using Dense Haplotype Data , 2012, PLoS genetics.

[29]  Michael Hahsler,et al.  Getting Things in Order: An Introduction to the R Package seriation , 2008 .

[30]  Nicholas J White,et al.  Spread of a single multidrug resistant malaria parasite lineage (PfPailin) to Vietnam. , 2017, The Lancet. Infectious diseases.

[31]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[32]  Caroline O. Buckee,et al.  Identity-by-descent relatedness estimates with uncertainty characterise departure from isolation-by-distance between Plasmodium falciparum populations on the Colombian-Pacific coast , 2020, bioRxiv.

[33]  S. Schaffner,et al.  hmmIBD: software to infer pairwise identity by descent between haploid genotypes , 2017, bioRxiv.

[34]  Susan Holmes,et al.  Ten quick tips for effective dimensionality reduction , 2019, PLoS Comput. Biol..

[35]  Kristan A. Schneider,et al.  Malaria in Venezuela: changes in the complexity of infection reflects the increment in transmission intensity , 2020, Malaria Journal.

[36]  John C. Tan,et al.  Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing , 2012, Nature.

[37]  Richard J. Maude,et al.  Evolution and expansion of multidrug resistant malaria in Southeast Asia: a genomic epidemiology study , 2019 .

[38]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[39]  G. McVean A Genealogical Interpretation of Principal Components Analysis , 2009, PLoS genetics.

[40]  A. Saltelli,et al.  A short comment on statistical versus mathematical modelling , 2019, Nature Communications.

[41]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[42]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[43]  Caroline O. Buckee,et al.  Estimating Relatedness Between Malaria Parasites , 2019, Genetics.

[44]  Gilean McVean,et al.  Genetic architecture of artemisinin-resistant Plasmodium falciparum , 2015, Nature Genetics.

[45]  Nicholas P. J. Day,et al.  Genomic epidemiology of artemisinin resistant malaria. , 2016, eLife.

[46]  D. M. de Vienne,et al.  Tanglegrams Are Misleading for Visual Evaluation of Tree Congruence. , 2018, Molecular biology and evolution.

[47]  Richard J Maude,et al.  Evolution and expansion of multidrug-resistant malaria in southeast Asia: a genomic epidemiology study , 2019, bioRxiv.

[48]  G. T. Jones ‘Surely You're Joking, Mr Feynman!’ Adventures of a Curious Character , 1985 .

[49]  Melanie Bahlo,et al.  Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens , 2018, PLoS genetics.

[50]  Gil McVean,et al.  The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria , 2018, bioRxiv.

[51]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[52]  Richard J Maude,et al.  Determinants of dihydroartemisinin-piperaquine treatment failure in Plasmodium falciparum malaria in Cambodia, Thailand, and Vietnam: a prospective clinical, pharmacological, and genetic study , 2019, The Lancet. Infectious diseases.

[53]  Organización Mundial de la Salud Guidelines for the treatment of malaria , 2010 .

[54]  Mehul Dhorda,et al.  The spread of artemisinin-resistant Plasmodium falciparum in the Greater Mekong subregion: a molecular epidemiology observational study , 2017, The Lancet. Infectious diseases.

[55]  Jonathan J. Juliano,et al.  The impact of antimalarial resistance on the genetic structure of Plasmodium falciparum in the DRC , 2019, bioRxiv.

[56]  L. Ranford-Cartwright,et al.  Spreading the seeds of million-murdering death: metamorphoses of malaria in the mosquito. , 2005, Trends in parasitology.

[57]  Caroline O. Buckee,et al.  Estimating Relatedness Between Malaria Parasites , 2019, Genetics.

[58]  Daniel Falush,et al.  A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots , 2018, Nature Communications.