Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences

Background In molecular epidemiology, comparison of intra-host viral variants among infected persons is frequently used for tracing transmissions in human population and detecting viral infection outbreaks. Application of Ultra-Deep Sequencing (UDS) immensely increases the sensitivity of transmission detection but brings considerable computational challenges when comparing all pairs of sequences. We developed a new population comparison method based on convex hulls in hamming space. We applied this method to a large set of UDS samples obtained from unrelated cases infected with hepatitis C virus (HCV) and compared its performance with three previously published methods. Results The convex hull in hamming space is a data structure that provides information on: (1) average hamming distance within the set, (2) average hamming distance between two sets; (3) closeness centrality of each sequence; and (4) lower and upper bound of all the pairwise distances among the members of two sets. This filtering strategy rapidly and correctly removes 96.2% of all pairwise HCV sample comparisons, outperforming all previous methods. The convex hull distance (CHD) algorithm showed variable performance depending on sequence heterogeneity of the studied populations in real and simulated datasets, suggesting the possibility of using clustering methods to improve the performance. To address this issue, we developed a new clustering algorithm, k-hulls, that reduces heterogeneity of the convex hull. This efficient algorithm is an extension of the k-means algorithm and can be used with any type of categorical data. It is 6.8-times more accurate than k-mode, a previously developed clustering algorithm for categorical data. Conclusions CHD is a fast and efficient filtering strategy for massively reducing the computational burden of pairwise comparison among large samples of sequences, and thus, aiding the calculation of transmission links among infected individuals using threshold-based methods. In addition, the convex hull efficiently obtains important summary metrics for intra-host viral populations.

[1]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[2]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[3]  Yury Khudyakov,et al.  Patient-to-Patient Hepatitis C Virus Transmissions Associated withInfection Control Breaches in a Hemodialysis Unit , 2011 .

[4]  David S. Campo,et al.  Recent Population Expansions of Hepatitis B Virus in the United States , 2014, Journal of Virology.

[5]  Sharma V. Thankachan,et al.  Efficient detection of viral transmissions with Next-Generation Sequencing data , 2017, BMC Genomics.

[6]  David S. Campo,et al.  GHOST: global hepatitis outbreak and surveillance technology , 2017, BMC Genomics.

[7]  I. Williams,et al.  Epidemiology of hepatitis C in the United States. , 1999, The American journal of medicine.

[8]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[9]  Harold S Margolis,et al.  Hepatitis C virus genotypes and viral concentrations in participants of a general population survey in the United States. , 2006, Gastroenterology.

[10]  Alberto Moraglio,et al.  Towards a geometric unification of evolutionary algorithms , 2008 .

[11]  David S. Campo,et al.  Entropy of mitochondrial DNA circulating in blood is associated with hepatocellular carcinoma , 2019, BMC Medical Genomics.

[12]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[15]  Xuemin Lin,et al.  Efficient exact edit similarity query processing with the asymmetric signature scheme , 2011, SIGMOD '11.

[16]  Chirag Jain,et al.  Efficient detection of viral transmission with threshold-based methods , 2015, 2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[17]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[18]  Mohamed Abdel-Hamid,et al.  A Novel Method to Identify Routes of Hepatitis C Virus Transmission , 2014, PloS one.

[19]  Jonathan Mermin,et al.  Estimating Prevalence of Hepatitis C Virus Infection in the United States, 2013‐2016 , 2018, Hepatology.

[20]  M. Salemi,et al.  The Threshold Bootstrap Clustering: A New Approach to Find Families or Transmission Clusters within Molecular Quasispecies , 2010, PloS one.

[21]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[22]  Esko Ukkonen,et al.  MOODS: fast search for position weight matrix matches in DNA sequences , 2009, Bioinform..

[23]  Manfred Eigen From Strange Simplicity to Complex Familiarity: A Treatise on Matter, Information, Life and Thought , 2013 .

[24]  David S. Campo,et al.  Accurate Genetic Detection of Hepatitis C Virus Transmissions in Outbreak Settings. , 2016, The Journal of infectious diseases.

[25]  Borys Wróbel,et al.  Molecular evolution in court: analysis of a large hepatitis C virus outbreak from an evolving source , 2013, BMC Biology.

[26]  Pavel Skums,et al.  Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants , 2018, BMC Bioinformatics.

[27]  A. Moya,et al.  Molecular Epidemiology of a Hepatitis C Virus Outbreak in a Hemodialysis Unit , 2005, Journal of Clinical Microbiology.

[28]  David S. Campo,et al.  Detection of hepatitis C virus transmission by use of DNA mass spectrometry. , 2013, The Journal of infectious diseases.

[29]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[30]  Maria Rosaria Capobianchi,et al.  Molecular epidemiology of a hepatitis C virus outbreak in a hemodialysis unit in Italy , 2008, Journal of medical virology.

[31]  Pavel Skums,et al.  Next-generation sequencing reveals large connected networks of intra-host HCV variants , 2014, BMC Genomics.

[32]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .