Decoding sequence-level information to predict membrane protein expression

The expression and purification of integral membrane proteins remains a major bottleneck in the characterization of these important proteins. Expression levels are currently unpredictable, which renders the pursuit of these targets challenging and highly inefficient. Evidence demonstrates that small changes in the nucleotide or amino-acid sequence can dramatically affect membrane protein biogenesis; yet these observations have not resulted in generalizable approaches to improve expression. In this study, we develop a data-driven statistical model that predicts membrane protein expression in E. coli directly from sequence. The model, trained on experimental data, combines a set of sequence-derived variables resulting in a score that predicts the likelihood of expression. We test the model against various independent datasets from the literature that contain a variety of scales and experimental outcomes demonstrating that the model significantly enriches expressed proteins. The model is then used to score expression for membrane proteomes and protein families highlighting areas where the model excels. Surprisingly, analysis of the underlying features reveals an importance in nucleotide sequence-derived parameters for expression. This computational model, as illustrated here, can immediately be used to identify favorable targets for characterization.

[1]  Pietro Liò,et al.  The BioMart community portal: an innovative alternative to large, centralized data repositories , 2015, Nucleic Acids Res..

[2]  P. Fayers,et al.  The Visual Display of Quantitative Information , 1990 .

[3]  Inna Dubchak,et al.  The genome portal of the Department of Energy Joint Genome Institute: 2014 updates , 2013, Nucleic Acids Res..

[4]  Marco Punta,et al.  The New York Consortium on Membrane Protein Structure (NYCOMPS): a high-throughput platform for structural genomics of integral membrane proteins , 2010, Journal of Structural and Functional Genomics.

[5]  Stanley Fields,et al.  Adjacent Codons Act in Concert to Modulate Translation Efficiency in Yeast , 2016, Cell.

[6]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[7]  G. Heijne,et al.  Molecular code for transmembrane-helix recognition by the Sec61 translocon , 2007, Nature.

[8]  Yitzhak Pilpel,et al.  mRNA-programmed translation pauses in the targeting of E. coli membrane proteins , 2014, eLife.

[9]  Milton H. Saier,et al.  The Transporter Classification Database (TCDB): recent advances , 2015, Nucleic Acids Res..

[10]  Pierre Lechat,et al.  GenoList: an integrated environment for comparative analysis of microbial genomes , 2007, Nucleic Acids Res..

[11]  Liubin Feng,et al.  Crysalis: an integrated server for computational analysis and design of protein crystallization , 2016, Scientific Reports.

[12]  J A Swets,et al.  Better decisions through science. , 2000, Scientific American.

[13]  T. Steitz,et al.  Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. , 1986, Annual review of biophysics and biophysical chemistry.

[14]  Lukasz A. Kurgan,et al.  Sequence-based prediction of protein crystallization, purification and production propensity , 2011, Bioinform..

[15]  Andrew E. Bruno,et al.  Statistical Analysis of Crystallization Database Links Protein Physico-Chemical Features with Crystallization Mechanisms , 2013, PloS one.

[16]  Tanneguy Redarce,et al.  Automatic Lip-Contour Extraction and Mouth-Structure Segmentation in Images , 2011, Computing in Science & Engineering.

[17]  Marco Punta,et al.  Structural genomics target selection for the New York consortium on membrane protein structure , 2009, Journal of Structural and Functional Genomics.

[18]  D. Rees,et al.  The funnel approach to the precrystallization production of membrane proteins. , 2008, Journal of molecular biology.

[19]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[20]  Martin Schader,et al.  Data Analysis and Decision Support , 2006 .

[21]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[22]  Zheng Rong Yang,et al.  RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins , 2005, Bioinform..

[23]  Gene-Wei Li,et al.  The anti-Shine-Dalgarno sequence drives translational pausing and codon choice in bacteria , 2012, Nature.

[24]  S. White,et al.  Biophysical dissection of membrane proteins , 2009, Nature.

[25]  Sriram Kosuri,et al.  Causes and Effects of N-Terminal Codon Bias in Bacterial Genes , 2013, Science.

[26]  Andrzej Joachimiak,et al.  Predicting protein crystallization propensity from protein sequence , 2010, Journal of Structural and Functional Genomics.

[27]  A. Krogh,et al.  A combined transmembrane topology and signal peptide prediction method. , 2004, Journal of molecular biology.

[28]  Simon Garnier,et al.  Default Color Maps from 'matplotlib' , 2015 .

[29]  J. Sengupta The Nonparametric Approach , 1989 .

[30]  C. Holz,et al.  Expression screening of integral membrane proteins from Helicobacter pylori 26695 , 2007, Protein science : a publication of the Protein Society.

[31]  Leszek Rychlewski,et al.  XtalPred: a web server for prediction of protein crystallizability , 2007, Bioinform..

[32]  Steve Weston,et al.  Provides Foreach Looping Construct for R , 2015 .

[33]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[34]  I-Min A. Chen,et al.  IMG 4 version of the integrated microbial genomes comparative analysis system , 2013, Nucleic Acids Res..

[35]  Cynthia A. Brewer,et al.  ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps , 2003 .

[36]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[37]  B. Rost,et al.  Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data , 2009, Nature Biotechnology.

[38]  Leszek Rychlewski,et al.  The challenge of protein structure determination—lessons from structural genomics , 2007, Protein science : a publication of the Protein Society.

[39]  Morten H. H. Nørholm,et al.  Improved production of membrane proteins in Escherichia coli by selective codon substitutions , 2013, FEBS letters.

[40]  Janis P Bellack,et al.  Then and now. , 2011, The Journal of nursing education.

[41]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[42]  V. Cherezov,et al.  A Bright Future for Serial Femtosecond Crystallography with XFELs. , 2017, Trends in biochemical sciences.

[43]  Yihui Xie,et al.  knitr: A Comprehensive Tool for Reproducible Research in R , 2018, Implementing Reproducible Research.

[44]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[45]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[46]  Steve Weston,et al.  Provides Iterator Construct for R , 2015 .

[47]  Thomas F. Miller,et al.  Improving membrane protein expression by optimizing integration efficiency , 2017, The Journal of Biological Chemistry.

[48]  Eric R Geertsma,et al.  Quality control of overexpressed membrane proteins , 2008, Proceedings of the National Academy of Sciences.

[49]  Adam Godzik,et al.  Improving the chances of successful protein structure determination with a random forest classifier. , 2014, Acta crystallographica. Section D, Biological crystallography.

[50]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[51]  Claus Weihs,et al.  klaR Analyzing German Business Cycles , 2005, Data Analysis and Decision Support.

[52]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[53]  G. vanRossum Python reference manual , 1995 .

[54]  Conrad Steenberg,et al.  NUPACK: Analysis and design of nucleic acid systems , 2011, J. Comput. Chem..

[55]  Yves Van de Peer,et al.  Genome sequence of the recombinant protein production host Pichia pastoris , 2009, Nature Biotechnology.

[56]  C. Wilke Streamlined Plot Theme and Plot Annotations for 'ggplot2' , 2015 .

[57]  G. Heijne The distribution of positively charged residues in bacterial inner membrane proteins correlates with the trans‐membrane topology , 1986, The EMBO journal.

[58]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[59]  John D. Westbrook,et al.  The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods , 2011, Journal of Structural and Functional Genomics.

[60]  Johannes E. Schindelin,et al.  Fiji: an open-source platform for biological-image analysis , 2012, Nature Methods.

[61]  A. Kolstø,et al.  A genomic strategy for cloning, expressing and purifying efflux proteins of the major facilitator superfamily. , 2007, The Journal of antimicrobial chemotherapy.

[62]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[63]  W. Hendrickson Atomic-level analysis of membrane-protein structure , 2016, Nature Structural &Molecular Biology.

[64]  B. Auguié Miscellaneous Functions for "Grid" Graphics , 2015 .

[65]  Tamir Gonen,et al.  MicroED opens a new era for biological structure determination. , 2016, Current opinion in structural biology.

[66]  Annie Frelet-Barrand,et al.  Heterologous Expression of Membrane Proteins: Choosing the Appropriate Host , 2011, PloS one.

[67]  Marcin J Mizianty,et al.  CRYSpred: accurate sequence-based protein crystallization propensity prediction using sequence-derived structural characteristics. , 2012, Protein and peptide letters.

[68]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[69]  Gunnar von Heijne,et al.  Control of Membrane Protein Topology by a Single C-Terminal Residue , 2010, Science.

[70]  T. Gibson,et al.  Protein disorder prediction: implications for structural proteomics. , 2003, Structure.

[71]  J. Frydman,et al.  Cotranslational signal independent SRP preloading during membrane targeting , 2016, Nature.

[72]  Mark Gerstein,et al.  SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics , 2001, Nucleic Acids Res..

[73]  Mindy I. Davis,et al.  Breaking Cryo-EM Resolution Barriers to Facilitate Drug Discovery , 2016, Cell.

[74]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..

[75]  Hirokazu Chiba,et al.  MBGD update 2015: microbial genome database for flexible ortholog analysis utilizing a diverse set of genomic data , 2014, Nucleic Acids Res..

[76]  Arne Elofsson,et al.  Manipulating the genetic code for membrane protein production: what have we learnt so far? , 2012, Biochimica et biophysica acta.

[77]  Samad Jahandideh,et al.  RFCRYS: sequence-based protein crystallization propensity prediction by means of random forest. , 2012, Journal of theoretical biology.

[78]  Yuanzi Hua,et al.  Cloning and expression of multiple integral membrane proteins from Mycobacterium tuberculosis in Escherichia coli , 2005, Protein science : a publication of the Protein Society.

[79]  Kimberly Van Auken,et al.  WormBase 2014: new views of curated biology , 2013, Nucleic Acids Res..

[80]  Hadley Wickham,et al.  The Split-Apply-Combine Strategy for Data Analysis , 2011 .

[81]  Andreas Plückthun,et al.  Directed evolution of a G protein-coupled receptor for expression, stability, and binding selectivity , 2008, Proceedings of the National Academy of Sciences.

[82]  Lukasz Kurgan,et al.  Meta prediction of protein crystallization propensity. , 2009, Biochemical and biophysical research communications.

[83]  Edward Rolf Tufte,et al.  The visual display of quantitative information , 1985 .

[84]  H. Michel,et al.  Comparative analysis and “expression space” coverage of the production of prokaryotic membrane proteins for structural genomics , 2006, Protein science : a publication of the Protein Society.

[85]  Thomas F. Miller,et al.  A Link between Integral Membrane Protein Expression and Simulated Integration Efficiency. , 2016, Cell reports.

[86]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[87]  J. R. Coleman,et al.  Virus Attenuation by Genome-Scale Changes in Codon Pair Bias , 2008, Science.

[88]  Lorenz Wernisch,et al.  Unexpected correlations between gene expression and codon usage bias from microarray data for the whole Escherichia coli K-12 genome. , 2003, Nucleic Acids Research.

[89]  A. Desmyter,et al.  Structural genomics on membrane proteins: comparison of more than 100 GPCRs in 3 expression systems , 2007, Journal of Structural and Functional Genomics.

[90]  R. Neutze,et al.  Effective high-throughput overproduction of membrane proteins in Escherichia coli. , 2008, Protein expression and purification.

[91]  Peter F. Stadler,et al.  ViennaRNA Package 2.0 , 2011, Algorithms for Molecular Biology.

[92]  S. Salzberg,et al.  Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima , 1999, Nature.

[93]  A. Kletzin,et al.  An Efficient Strategy for Small-Scale Screening and Production of Archaeal Membrane Transport Proteins in Escherichia coli , 2013, PloS one.

[94]  Andreas Plückthun,et al.  Critical features for biosynthesis, stability, and functionality of a G protein-coupled receptor uncovered by all-versus-all mutations , 2012, Proceedings of the National Academy of Sciences.

[95]  Morten H. H. Nørholm,et al.  Enhanced Protein Production in Escherichia coli by Optimization of Cloning Scars at the Vector-Coding Sequence Junction. , 2015, ACS synthetic biology.

[96]  James E. Bray,et al.  High-throughput production of prokaryotic membrane proteins , 2005, Journal of Structural and Functional Genomics.

[97]  Edith D. Wong,et al.  The Reference Genome Sequence of Saccharomyces cerevisiae: Then and Now , 2013, G3: Genes, Genomes, Genetics.

[98]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[99]  Mark A. Girolami,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btn055 Sequence analysis ParCrys: a Parzen window density estimation approach , 2022 .

[100]  O. Bocharova,et al.  Expression of G-protein coupled receptors in Escherichia coli for structural studies , 2010, Biochemistry (Moscow).

[101]  G. von Heijne,et al.  Materials and Methods Figs. S1 to S3 References and Notes Global Topology Analysis of the Escherichia Coli Inner Membrane Proteome , 2022 .

[102]  P. Nordlund,et al.  An efficient strategy for high‐throughput expression screening of recombinant integral membrane proteins , 2005, Protein science : a publication of the Protein Society.

[103]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[104]  Jianzhao Gao,et al.  Improved Prediction of Protein Crystallization, Purification and Production Propensity Using Hybrid Sequence Representation , 2013 .

[105]  Samuel Wagner,et al.  Tuning Escherichia coli for membrane protein overexpression , 2008, Proceedings of the National Academy of Sciences.

[106]  Christopher G. Tate,et al.  Overcoming barriers to membrane protein structure determination , 2011, Nature Biotechnology.

[107]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[108]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[109]  Peden Jf,et al.  Analysis of codon usage. , 2000 .

[110]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[111]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[112]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[113]  C. Jeffery,et al.  Recombinant Expression Screening of P. aeruginosa Bacterial Inner Membrane Proteins , 2009, BMC biotechnology.

[114]  Jindan Zhou,et al.  EcoGene 3.0 , 2012, Nucleic Acids Res..

[115]  Jun Hu,et al.  TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM , 2016, Amino Acids.

[116]  T. Creamer,et al.  Solvation energies of amino acid side chains and backbone in a family of host-guest pentapeptides. , 1996, Biochemistry.

[117]  Thomas F. Miller,et al.  Regulation of multispanning membrane protein topology via post-translational annealing , 2015, eLife.

[118]  Kevin W Eliceiri,et al.  NIH Image to ImageJ: 25 years of image analysis , 2012, Nature Methods.

[119]  Ganesan Pugalenthi,et al.  SVMCRYS: an SVM approach for the prediction of protein crystallization propensity from protein sequence. , 2010, Protein and peptide letters.