Public health in genetic spaces: a statistical framework to optimize cluster-based outbreak detection

Abstract Genetic clustering is a popular method for characterizing variation in transmission rates for rapidly evolving viruses, and could potentially be used to detect outbreaks in ‘near real time’. However, the statistical properties of clustering are poorly understood in this context, and there are no objective guidelines for setting clustering criteria. Here, we develop a new statistical framework to optimize a genetic clustering method based on the ability to forecast new cases. We analysed the pairwise Tamura-Nei (TN93) genetic distances for anonymized HIV-1 subtype B pol sequences from Seattle (n = 1,653) and Middle Tennessee, USA (n = 2,779), and northern Alberta, Canada (n = 809). Under varying TN93 thresholds, we fit two models to the distributions of new cases relative to clusters of known cases: 1, a null model that assumes cluster growth is strictly proportional to cluster size, i.e. no variation in transmission rates among individuals; and 2, a weighted model that incorporates individual-level covariates, such as recency of diagnosis. The optimal threshold maximizes the difference in information loss between models, where covariates are used most effectively. Optimal TN93 thresholds varied substantially between data sets, e.g. 0.0104 in Alberta and 0.016 in Seattle and Tennessee, such that the optimum for one population would potentially misdirect prevention efforts in another. For a given population, the range of thresholds where the weighted model conferred greater predictive accuracy tended to be narrow (±0.005 units), and the optimal threshold tended to be stable over time. Our framework also indicated that variation in the recency of HIV diagnosis among clusters was significantly more predictive of new cases than sample collection dates (ΔAIC > 50). These results suggest that one cannot rely on historical precedence or convention to configure genetic clustering methods for public health applications, especially when translating methods between settings of low-level and generalized epidemics. Our framework not only enables investigators to calibrate a clustering method to a specific public health setting, but also provides a variable selection procedure to evaluate different predictive models of cluster growth.

[1]  Tomoki Nakaya,et al.  An Information Statistical Approach to the Modifiable Areal Unit Problem in Incidence Rate Maps , 2000 .

[2]  Gary C. White,et al.  Statistical Applications in the Spatial Sciences. , 1981 .

[3]  Satoru Kawai,et al.  An Algorithm for Drawing General Undirected Graphs , 1989, Inf. Process. Lett..

[4]  Sikhulile Moyo,et al.  Impact of sampling density on the extent of HIV clustering. , 2014, AIDS research and human retroviruses.

[5]  A. Lawson,et al.  Review of methods for space–time disease surveillance , 2010, Spatial and Spatio-temporal Epidemiology.

[6]  W. Lipkin,et al.  Precision Surveillance for Viral Respiratory Pathogens: Virome Capture Sequencing for the Detection and Genomic Characterization of Severe Acute Respiratory Infection in Uganda , 2018, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[7]  Myat Su Yin,et al.  Complexity-Based Spatial Hierarchical Clustering for Malaria Prediction , 2018, J. Heal. Informatics Res..

[8]  R. Remien,et al.  Stigma in the HIV/AIDS epidemic: a review of the literature and recommendations for the way forward , 2008, AIDS.

[9]  C. Fraser,et al.  Sources of HIV infection among men having sex with men and implications for prevention , 2016, Science Translational Medicine.

[10]  Ellsworth M. Campbell,et al.  Identifying Clusters of Recent and Rapid HIV Transmission Through Analysis of Molecular Surveillance Data , 2018, Journal of acquired immune deficiency syndromes.

[11]  Jan Albert,et al.  Defining HIV-1 transmission clusters based on sequence data , 2017, AIDS.

[12]  Lin Liu,et al.  Reducing MAUP bias of correlation statistics between water quality and GI illness , 2008, Comput. Environ. Urban Syst..

[13]  M. Sawada,et al.  The modifiable areal unit problem (MAUP) in the relationship between exposure to NO2 and respiratory health , 2011, International journal of health geographics.

[14]  T. Cheng,et al.  Modifiable Temporal Unit Problem (MTUP) and Its Effect on Space-Time Cluster Detection , 2014, PloS one.

[15]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[16]  A. France,et al.  Estimating Effects of HIV Sequencing Data Completeness on Transmission Network Patterns and Detection of Growing HIV Transmission Clusters. , 2019, AIDS research and human retroviruses.

[17]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[18]  O. Laeyendecker,et al.  Identifying Transmission Clusters with Cluster Picker and HIV-TRACE. , 2016, AIDS research and human retroviruses.

[19]  P. Harrigan,et al.  The impact of clinical, demographic and risk factors on rates of HIV transmission: a population-based phylogenetic analysis in British Columbia, Canada. , 2015, The Journal of infectious diseases.

[20]  Ann M. Dennis,et al.  Phylogenetic insights into regional HIV transmission , 2012, AIDS.

[21]  Erik M. Volz,et al.  Simple Epidemiological Dynamics Explain Phylogenetic Clustering of HIV from Patients with Recent Infection , 2012, PLoS Comput. Biol..

[22]  Brandon D. L. Marshall,et al.  Phylogenetic clustering of hepatitis C virus among people who inject drugs in Vancouver, Canada , 2014, Hepatology.

[23]  A. Leigh Brown,et al.  Recent and Rapid Transmission of HIV Among People Who Inject Drugs in Scotland Revealed Through Phylogenetic Analysis , 2018, The Journal of infectious diseases.

[24]  Ann M. Dennis,et al.  HIV-1 Transmission Clustering and Phylodynamics Highlight the Important Role of Young Men Who Have Sex with Men , 2018, AIDS research and human retroviruses.

[25]  Steven Weaver,et al.  HIV-TRACE (TRAnsmission Cluster Engine): a Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens. , 2018, Molecular biology and evolution.

[26]  J. O. Wertheim,et al.  Using Molecular HIV Surveillance Data to Understand Transmission Between Subpopulations in the United States , 2015, Journal of acquired immune deficiency syndromes.

[27]  J. Hemelaar Implications of HIV diversity for the HIV-1 pandemic. , 2013, The Journal of infection.

[28]  C. Fraser,et al.  Molecular Epidemiology of HIV-1 Subtype B Reveals Heterogeneous Transmission Risk: Implications for Intervention and Control , 2018, The Journal of infectious diseases.

[29]  Ben Murrell,et al.  Social and Genetic Networks of HIV-1 Transmission in New York City , 2017, PLoS pathogens.

[30]  Ann M. Dennis,et al.  Characterizing HIV transmission networks across the United States. , 2012, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[31]  Daniel J. Wilson,et al.  Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study , 2013, The Lancet. Infectious diseases.

[32]  S. Openshaw A million or so correlation coefficients : three experiments on the modifiable areal unit problem , 1979 .

[33]  WolfElizabeth,et al.  Short Communication: Phylogenetic Evidence of HIV-1 Transmission Between Adult and Adolescent Men Who Have Sex with Men. , 2016 .

[34]  Genshiro Kitagawa,et al.  Selected papers of Hirotugu Akaike , 1998 .

[35]  P. Lemey,et al.  The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data. , 2017, Infection, Genetics and Evolution.

[36]  Richard Platt,et al.  Automated Detection of Infectious Disease Outbreaks in Hospitals: A Retrospective Cohort Study , 2010, PLoS medicine.

[37]  Samantha Lycett,et al.  Automated analysis of phylogenetic clusters , 2013, BMC Bioinformatics.

[38]  Art F. Y. Poon,et al.  Near real-time monitoring of HIV transmission hotspots from routine HIV genotyping: an implementation case study , 2016, The lancet. HIV.

[39]  M. Kulldorff,et al.  A Space–Time Permutation Scan Statistic for Disease Outbreak Detection , 2005, PLoS medicine.

[40]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[41]  Forrest W. Crawford,et al.  Dynamics of the HIV outbreak and response in Scott County, IN, USA, 2011-15: a modelling study. , 2018, The lancet. HIV.

[42]  Tulio de Oliveira,et al.  Transmission networks and risk of HIV infection in KwaZulu-Natal, South Africa: a community-wide phylogenetic study. , 2017, The lancet. HIV.

[43]  D. M. Junqueira,et al.  Short-Term Dynamic and Local Epidemiological Trends in the South American HIV-1B Epidemic , 2016, PloS one.

[44]  Ann M. Dennis,et al.  Prediction of HIV Transmission Cluster Growth With Statewide Surveillance Data , 2019, Journal of acquired immune deficiency syndromes.

[45]  A. Poon Impacts and shortcomings of genetic clustering methods for infectious disease outbreaks , 2016, Virus evolution.

[46]  P. Vernazza,et al.  Can the UNAIDS 90-90-90 target be achieved? A systematic analysis of national HIV treatment cascades , 2016, BMJ Global Health.