SuperDCA for genome-wide epistasis analysis

The potential for genome-wide modelling of epistasis has recently surfaced given the possibility of sequencing densely sampled populations and the emerging families of statistical interaction models. Direct coupling analysis (DCA) has previously been shown to yield valuable predictions for single protein structures, and has recently been extended to genome-wide analysis of bacteria, identifying novel interactions in the co-evolution between resistance, virulence and core genome elements. However, earlier computational DCA methods have not been scalable to enable model fitting simultaneously to 104–105 polymorphisms, representing the amount of core genomic variation observed in analyses of many bacterial species. Here, we introduce a novel inference method (SuperDCA) that employs a new scoring principle, efficient parallelization, optimization and filtering on phylogenetic information to achieve scalability for up to 105 polymorphisms. Using two large population samples of Streptococcus pneumoniae, we demonstrate the ability of SuperDCA to make additional significant biological findings about this major human pathogen. We also show that our method can uncover signals of selection that are not detectable by genome-wide association analysis, even though our analysis does not require phenotypic measurements. SuperDCA, thus, holds considerable potential in building understanding about numerous organisms at a systems biological level.

[1]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[2]  Marcus Hutter,et al.  Distribution of Mutual Information , 2001, NIPS.

[3]  Simona Cocco,et al.  Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction , 2015, Nucleic acids research.

[4]  Georgios A. Pavlopoulos,et al.  Protein structure determination using metagenome sequence data , 2017, Science.

[5]  Jukka Corander,et al.  Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes , 2016, Nature Communications.

[6]  Magnus Ekeberg,et al.  Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences , 2014, J. Comput. Phys..

[7]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[8]  Terence Hwa,et al.  Direct coupling analysis for protein contact prediction. , 2014, Methods in molecular biology.

[9]  Peter E. Chen,et al.  The advent of genome-wide association studies for bacteria. , 2015, Current opinion in microbiology.

[10]  W. Hanage,et al.  Comprehensive Identification of Single Nucleotide Polymorphisms Associated with Beta-lactam Resistance within Pneumococcal Mosaic Genes , 2014, PLoS genetics.

[11]  Marcin J. Skwark,et al.  Interacting networks of resistance, virulence and core machinery genes identified by genome-wide epistasis analysis , 2016, bioRxiv.

[12]  Ian K. Blaby,et al.  Experimental Evolution of a Facultative Thermophile from a Mesophilic Ancestor , 2011, Applied and Environmental Microbiology.

[13]  J. Corander,et al.  Climate induces seasonality in pneumococcal transmission , 2015, Scientific Reports.

[14]  Jukka Corander,et al.  Hierarchical and Spatially Explicit Clustering of DNA Sequences with BAPS Software , 2013, Molecular biology and evolution.

[15]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[16]  Georgios S. Vernikos,et al.  Identification, variation and transcription of pneumococcal repeat sequences , 2011, BMC Genomics.

[17]  Jukka Corander,et al.  Dense genomic sampling identifies highways of pneumococcal recombination , 2014, Nature Genetics.

[18]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[19]  Jukka Corander,et al.  Inverse finite-size scaling for high-dimensional significance analysis. , 2018, Physical review. E.

[20]  D. Baker,et al.  Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information , 2014, eLife.

[21]  B. Walker,et al.  Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable , 2016, Nature Communications.

[22]  S. Shaw,et al.  Pbp2x localizes separately from Pbp2b and other peptidoglycan synthesis proteins during later stages of cell division of Streptococcus pneumoniae D39 , 2014, Molecular microbiology.

[23]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[24]  H. K. Kesavan,et al.  Bayesian estimation of shannon entropy , 1997 .

[25]  J. Corander,et al.  Genomic signatures of human and animal disease in the zoonotic pathogen Streptococcus suis , 2015, Nature Communications.

[26]  Markus Gruber,et al.  CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations , 2014, Bioinform..

[27]  M. Lipsitch,et al.  Population genomics of post-vaccine changes in pneumococcal epidemiology , 2013, Nature Genetics.

[28]  Ernesto García,et al.  Implications of Physiological Studies Based on Genomic Sequences: Streptococcus pneumoniae TIGR4 Synthesizes a Functional LytC Lysozyme , 2005, Journal of bacteriology.

[29]  Debora S. Marks,et al.  Quantification of the effect of mutations using a global probability model of natural sequence variation , 2015, 1510.04612.

[30]  J. Veening,et al.  Streptococcus pneumoniae PBP2x mid‐cell localization requires the C‐terminal PASTA domains and is essential for cell shape maintenance , 2014, Molecular microbiology.

[31]  Panayiotis V. Benos,et al.  Inferring protein-DNA dependencies using motif alignments and mutual information , 2007, ISMB/ECCB.

[32]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[33]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[34]  Johannes Söding,et al.  Big-data approaches to protein structure prediction , 2017, Science.

[35]  M. Weigt,et al.  Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1 , 2015, bioRxiv.

[36]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[37]  Marcin J. Skwark,et al.  Improving Contact Prediction along Three Dimensions , 2014, PLoS Comput. Biol..