Data-driven supervised learning of a viral protease specificity landscape from deep sequencing and molecular simulations

Significance Substrate specificity landscape of a protease enzyme is the set of all substrate sequences that are recognized/cut (and, as importantly, not recognized/cut) by the enzyme. Accurate and rapid elucidation of these landscapes for any given protease is key for the design of novel targeted proteases to prevent unwarranted off-target cleavage, and provides insight into the functional robustness of naturally occurring proteases. We developed a structure-guided approach for predicting protease substrate specificity landscapes, in which data from experiments in yeast and molecular simulations are combined using machine learning. Using this approach, we comprehensively map the sequence−energetics−function landscape of the hepatitis C virus NS3/4A protease and its drug-resistant variants. Biophysical interactions between proteins and peptides are key determinants of molecular recognition specificity landscapes. However, an understanding of how molecular structure and residue-level energetics at protein−peptide interfaces shape these landscapes remains elusive. We combine information from yeast-based library screening, next-generation sequencing, and structure-based modeling in a supervised machine learning approach to report the comprehensive sequence−energetics−function mapping of the specificity landscape of the hepatitis C virus (HCV) NS3/4A protease, whose function—site-specific cleavages of the viral polyprotein—is a key determinant of viral fitness. We screened a library of substrates in which five residue positions were randomized and measured cleavability of ∼30,000 substrates (∼1% of the library) using yeast display and fluorescence-activated cell sorting followed by deep sequencing. Structure-based models of a subset of experimentally derived sequences were used in a supervised learning procedure to train a support vector machine to predict the cleavability of 3.2 million substrate variants by the HCV protease. The resulting landscape allows identification of previously unidentified HCV protease substrates, and graph-theoretic analyses reveal extensive clustering of cleavable and uncleavable motifs in sequence space. Specificity landscapes of known drug-resistant variants are similarly clustered. The described approach should enable the elucidation and redesign of specificity landscapes of a wide variety of proteases, including human-origin enzymes. Our results also suggest a possible role for residue-level energetics in shaping plateau-like functional landscapes predicted from viral quasispecies theory.

[1]  J. Cristina,et al.  Hepatitis C virus genetic variability in patients undergoing antiviral therapy. , 2007, Virus research.

[2]  Claus O. Wilke,et al.  Mistranslation-Induced Protein Misfolding as a Dominant Constraint on Coding-Sequence Evolution , 2008, Cell.

[3]  D. Fairlie,et al.  Proteases universally recognize beta strands in their active sites. , 2005, Chemical reviews.

[4]  E. Domingo,et al.  RNA virus mutations and fitness for survival. , 1997, Annual review of microbiology.

[5]  M. Eigen,et al.  Viral quasispecies. , 1993, Scientific American.

[6]  A. Chakraborty,et al.  Deconstruction of the Ras switching cycle through saturation mutagenesis , 2017, eLife.

[7]  H. Kräusslich,et al.  Gag Mutations Strongly Contribute to HIV-1 Resistance to Protease Inhibitors in Highly Drug-Experienced Patients besides Compensating for Fitness Loss , 2009, PLoS pathogens.

[8]  R. Ernst Large Igneous Provinces , 2014, Encyclopedia of Geology.

[9]  Elena R. Lozovsky,et al.  Biophysical principles predict fitness landscapes of drug resistance , 2016, Proceedings of the National Academy of Sciences.

[10]  Sagar D. Khare,et al.  MFPred: Rapid and accurate prediction of protein-peptide recognition multispecificity using self-consistent mean field theory , 2017, PLoS Comput. Biol..

[11]  M. Eigen Viral quasi species. , 1993 .

[12]  David R. Liu,et al.  A system for the continuous directed evolution of proteases rapidly reveals drug-resistance mutations , 2014, Nature Communications.

[13]  Michael T. Laub,et al.  Pervasive degeneracy and epistasis in a protein-protein interface , 2015, Science.

[14]  David L. Young,et al.  High-throughput Analysis of in vivo Protein Stability* , 2013, Molecular & Cellular Proteomics.

[15]  Rafael Sanjuán,et al.  The distribution of fitness effects caused by single-nucleotide substitutions in an RNA virus. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  E. Bornberg-Bauer,et al.  Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Eugene I Shakhnovich,et al.  Bridging the physical scales in evolutionary biology: from protein sequence space to fitness of organisms and populations. , 2017, Current opinion in structural biology.

[18]  S. Wright Evolution in mendelian populations , 1931 .

[19]  M. DePristo,et al.  Missense meanderings in sequence space: a biophysical view of protein evolution , 2005, Nature Reviews Genetics.

[20]  S. Fields,et al.  Deep mutational scanning: a new style of protein science , 2014, Nature Methods.

[21]  Michael Manhart,et al.  Protein folding and binding can emerge as evolutionary spandrels through structural coupling , 2014, Proceedings of the National Academy of Sciences.

[22]  Hong Cao,et al.  The Molecular Basis of Drug Resistance against Hepatitis C Virus NS3/4A Protease Inhibitors , 2012, PLoS pathogens.

[23]  A. Berger,et al.  On the size of the active site in proteases. I. Papain. , 1967, Biochemical and biophysical research communications.

[24]  C. Rice,et al.  Understanding the hepatitis C virus life cycle paves the way for highly effective therapies , 2013, Nature Medicine.

[25]  A. Chakraborty,et al.  Identification of drug resistance mutations in HIV from constraints on natural evolution. , 2015, Physical review. E.

[26]  Dmitry Chudakov,et al.  Local fitness landscape of the green fluorescent protein , 2016, Nature.

[27]  Christopher J. Oldfield,et al.  Do viral proteins possess unique biophysical features? , 2009, Trends in biochemical sciences.

[28]  M. Ostermeier,et al.  Environmental changes bridge evolutionary valleys , 2016, Science Advances.

[29]  Matthew R. McKay,et al.  Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies , 2018, Proceedings of the National Academy of Sciences.

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  L. Benatuil,et al.  An improved yeast transformation method for the generation of very large human antibody libraries. , 2010, Protein engineering, design & selection : PEDS.

[32]  H. Chan,et al.  Biophysics of protein evolution and evolutionary protein biophysics , 2014, Journal of The Royal Society Interface.

[33]  A. Lauring,et al.  The Mutational Robustness of Influenza A Virus , 2016, PLoS pathogens.

[34]  C. Wilke,et al.  Biophysical models of protein evolution: Understanding the patterns of evolutionary sequence divergence , 2016, bioRxiv.

[35]  R. Sanjuán,et al.  Highly heterogeneous mutation rates in the hepatitis C virus genome , 2016, Nature Microbiology.

[36]  F. J. Poelwijk,et al.  The spatial architecture of protein function and adaptation , 2012, Nature.

[37]  Amy E Keating,et al.  Epistatic mutations in PUMA BH3 drive an alternate binding mode to potently and selectively inhibit anti-apoptotic Bfl-1 , 2017, eLife.

[38]  Michael J. Harms,et al.  High-order epistasis shapes evolutionary trajectories , 2017, PLoS Comput. Biol..

[39]  R. Andino,et al.  Viral quasispecies. , 2015, Virology.

[40]  John Maynard Smith,et al.  Natural Selection and the Concept of a Protein Space , 1970, Nature.

[41]  Raul Andino,et al.  Mutational and fitness landscapes of an RNA virus revealed through population sequencing , 2013, Nature.

[42]  Dan S. Tawfik,et al.  Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature , 2016, Protein science : a publication of the Protein Society.

[43]  Sagar D Khare,et al.  Large‐scale Structure‐based Prediction and Identification of Novel Protease Substrates using Computational Protein Design , 2016, Journal of molecular biology.

[44]  Timothy A. Whitehead,et al.  Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded , 2017, Nature Communications.

[45]  J. Marcotrigiano,et al.  Viral precursor polyproteins: keys of regulation from replication to maturation , 2013, Current Opinion in Virology.

[46]  Jian-Rong Yang,et al.  Protein misinteraction avoidance causes highly expressed proteins to evolve slowly , 2012, Proceedings of the National Academy of Sciences.

[47]  Raul Andino,et al.  The role of mutational robustness in RNA virus evolution , 2013, Nature Reviews Microbiology.

[48]  Christoph Adami,et al.  Stability and the evolvability of function in a model protein. , 2004, Biophysical journal.

[49]  D. Bolon,et al.  Experimental illumination of a fitness landscape , 2011, Proceedings of the National Academy of Sciences.

[50]  M. Jacomy,et al.  ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software , 2014, PloS one.

[51]  A. Strongin,et al.  New Details of HCV NS3/4A Proteinase Functionality Revealed by a High-Throughput Cleavage Assay , 2012, PloS one.

[52]  Katherine Spindler,et al.  Rapid evolution of RNA genomes. , 1982, Science.

[53]  H J Alter,et al.  The outcome of acute hepatitis C predicted by the evolution of the viral quasispecies. , 2000, Science.

[54]  Ralf Bartenschlager,et al.  Cardif is an adaptor protein in the RIG-I antiviral pathway and is targeted by hepatitis C virus , 2005, Nature.

[55]  Justin R Klesmith,et al.  Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning , 2017, Proceedings of the National Academy of Sciences.

[56]  Adrian W. R. Serohijos,et al.  Merging molecular mechanism and evolution: theory and computation at the interface of biophysics and evolutionary population genetics. , 2014, Current opinion in structural biology.

[57]  D. Baker,et al.  High Resolution Mapping of Protein Sequence–Function Relationships , 2010, Nature Methods.

[58]  G. Georgiou,et al.  Profiling Protease Specificity: Combining Yeast ER Sequestration Screening (YESS) with Next Generation Sequencing. , 2017, ACS chemical biology.

[59]  George Georgiou,et al.  Engineering of TEV protease variants by yeast ER sequestration screening (YESS) of combinatorial libraries , 2013, Proceedings of the National Academy of Sciences.

[60]  Timothy A. Whitehead,et al.  High-Resolution Sequence-Function Mapping of Full-Length Proteins , 2015, PloS one.

[61]  Feng Ding,et al.  Emergence of Protein Fold Families through Rational Design , 2006, PLoS Comput. Biol..

[62]  M. Huynen,et al.  Neutral evolution of mutational robustness. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[63]  Amy E Keating,et al.  SORTCERY-A High-Throughput Method to Affinity Rank Peptide Ligands. , 2014, Journal of molecular biology.

[64]  J. Krug,et al.  Empirical fitness landscapes and the predictability of evolution , 2014, Nature Reviews Genetics.