DISSEQT—DIStribution-based modeling of SEQuence space Time dynamics†

Rapidly evolving microbes are a challenge to model because of the volatile, complex and dynamic nature of their populations. We developed the DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) for analyzing, visualizing and predicting the evolution of heterogeneous biological populations in multidimensional genetic space, suited for population-based modeling of deep sequencing and high-throughput data. DISSEQT is openly available on GitHub (https://github.com/rasmushenningsson/DISSEQT.jl) and Synapse (https://www.synapse.org/#!Synapse:syn11425758), covering the entire workflow from read alignment to visualization of results. DISSEQT is centered around robust dimension and model reduction algorithms for analysis of genotypic data with additional capabilities for including phenotypic features to explore dynamic genotype-phenotype maps. We illustrate its utility and capacity with examples from evolving RNA virus populations, which present on of the highest degrees of population heterogeneity found in nature. Using DISSEQT, we empirically reconstruct the evolutionary trajectories of evolving populations in sequence space and genotype-phenotype fitness landscapes. We show that while sequence space is vastly multidimensional, the relevant genetic space of evolving microbial populations is of intrinsically low dimension. In addition, evolutionary trajectories of these populations can be faithfully monitored to identify the key minority genotypes contributing most to evolution. Finally, we show that empirical fitness landscapes, when reconstructed to include minority variants, can predict phenotype from genotype with high accuracy.

[1]  Desmond G. Higgins Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets , 1992, Comput. Appl. Biosci..

[2]  Reinhold G. Herrmann,et al.  Complete nucleotide sequence of the , 2000 .

[3]  Thomas Martinetz,et al.  PhyloMap: an algorithm for visualizing relationships of large sequence data sets and its application to the influenza A virus genome , 2011, BMC Bioinformatics.

[4]  M. Eigen,et al.  What is a quasispecies? , 2006, Current topics in microbiology and immunology.

[5]  J. Bloom,et al.  Extreme heterogeneity of influenza virus infection in single cells , 2017, bioRxiv.

[6]  Rhonda Bacher,et al.  Design and computational analysis of single-cell RNA-sequencing experiments , 2016, Genome Biology.

[7]  T. Thomas,et al.  Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions , 2014, Microbial Informatics and Experimentation.

[8]  E. D. Weinberger,et al.  The NK model of rugged fitness landscapes and its application to maturation of the immune response. , 1989, Journal of theoretical biology.

[9]  Magnus Fontes,et al.  SMSSVD: SubMatrix Selection Singular Value Decomposition , 2017, Bioinform..

[10]  David Laehnemann,et al.  Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[11]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[12]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[13]  Valentine Svensson,et al.  Power Analysis of Single Cell RNA-Sequencing Experiments , 2016, Nature Methods.

[14]  R. Contreras,et al.  Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene , 1976, Nature.

[15]  N. Ben-Tal,et al.  Emergence and transmission of arbovirus evolutionary intermediates with epidemic potential. , 2014, Cell host & microbe.

[16]  Sarah A. Teichmann,et al.  Power Analysis of Single Cell RNA-Sequencing Experiments , 2016 .

[17]  Charlotte Soneson,et al.  The projection score - an evaluation criterion for variable subset selection in PCA visualization , 2011, BMC Bioinformatics.

[18]  M. Whitlock,et al.  FACTORS AFFECTING THE GENETIC LOAD IN DROSOPHILA: SYNERGISTIC EPISTASIS AND CORRELATIONS AMONG FITNESS COMPONENTS , 2000, Evolution; international journal of organic evolution.

[19]  Krishnendu Chatterjee,et al.  Biological auctions with multiple rewards , 2015, Proceedings of the Royal Society B: Biological Sciences.

[20]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[21]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[22]  J. Tait,et al.  Challenges and opportunities. , 1996, Journal of psychiatric and mental health nursing.

[23]  J. Bloom,et al.  Cooperation between distinct viral variants promotes growth of H3N2 influenza in cell culture , 2016, eLife.

[24]  M. Vignuzzi,et al.  Attenuation of RNA viruses by redirecting their evolution in sequence space , 2017, Nature Microbiology.

[25]  M. Vignuzzi,et al.  Isolation of Fidelity Variants of RNA Viruses and Characterization of Virus Mutation Frequency , 2011, Journal of visualized experiments : JoVE.

[26]  M. Vignuzzi,et al.  Group Selection and Contribution of Minority Variants during Virus Adaptation Determines Virus Fitness and Phenotype , 2015, PLoS pathogens.

[27]  S. Elena,et al.  A real-time RT-PCR assay for quantifying the fitness of tobacco etch virus in competition experiments. , 2007, Journal of virological methods.

[28]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[29]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[30]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[31]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[32]  W. Koh,et al.  Single-cell genome sequencing: current state of the science , 2016, Nature Reviews Genetics.

[33]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[34]  M. Emond,et al.  Accuracy of Next Generation Sequencing Platforms. , 2014, Next generation, sequencing & applications.

[35]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[36]  J. M. Smith,et al.  The Logic of Animal Conflict , 1973, Nature.

[37]  M. Whitlock,et al.  FACTORS AFFECTING THE GENETIC LOAD IN DROSOPHILA: SYNERGISTIC EPISTASIS AND CORRELATIONS AMONG FITNESS COMPONENTS , 2000, Evolution; international journal of organic evolution.

[38]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[39]  Elijah Paintsil,et al.  Competitive Fitness of Nevirapine-Resistant Human Immunodeficiency Virus Type 1 Mutants , 2004, Journal of Virology.

[40]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[41]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[42]  Karin J. Metzner,et al.  A Framework for Inferring Fitness Landscapes of Patient-Derived Viruses Using Quasispecies Theory , 2014, Genetics.

[43]  J. Nash The imbedding problem for Riemannian manifolds , 1956 .

[44]  Xiao Yang,et al.  V-Phaser 2: variant inference for viral populations , 2013, BMC Genomics.

[45]  B. Johansson,et al.  Identification of ETV6-RUNX1-like and DUX4-rearranged subtypes in paediatric B-cell precursor acute lymphoblastic leukaemia , 2016, Nature Communications.

[46]  Raul Andino,et al.  Mutational and fitness landscapes of an RNA virus revealed through population sequencing , 2013, Nature.

[47]  Sebastian Bonhoeffer,et al.  Exploring the Complexity of the HIV-1 Fitness Landscape , 2012, PLoS genetics.

[48]  M. Vignuzzi,et al.  Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population , 2006, Nature.

[49]  E. Domingo,et al.  Viral Quasispecies Evolution , 2012, Microbiology and Molecular Reviews.

[50]  K. Metzner,et al.  Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data , 2012, Front. Microbio..

[51]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[52]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[53]  Jeffrey M. Perkel,et al.  Single-cell sequencing made simple , 2017, Nature.