Filling-in void and sparse regions in protein sequence space by protein-like artificial sequences enables remarkable enhancement in remote homology detection capability.

Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like "linker" sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be "plugged-into" routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold.

[1]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[2]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[3]  N Srinivasan,et al.  Assessment of a Rigorous Transitive Profile Based Search Method to Detect Remotely Similar Proteins , 2005, Journal of biomolecular structure & dynamics.

[4]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[5]  A. Biegert,et al.  Sequence context-specific profiles for homology searching , 2009, Proceedings of the National Academy of Sciences.

[6]  Shashi B. Pandit,et al.  SUPFAM - a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes , 2002, Nucleic Acids Res..

[7]  Michael Levitt,et al.  Evolutionarily consistent families in SCOP: sequence, structure and function , 2012, BMC Structural Biology.

[8]  N Srinivasan,et al.  Cascaded walks in protein sequence space: use of artificial sequences in remote homology detection between natural proteins. , 2012, Molecular bioSystems.

[9]  N. Grishin,et al.  Reconstruction of ancestral protein sequences and its applications , 2004, BMC Evolutionary Biology.

[10]  A. Lesk,et al.  How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. , 1980, Journal of molecular biology.

[11]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[12]  C. Sander,et al.  The FSSP database of structurally aligned protein fold families. , 1994, Nucleic acids research.

[13]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[14]  S. L. Mayo,et al.  De novo protein design: fully automated sequence selection. , 1997, Science.

[15]  N Srinivasan,et al.  Strategies for the effective identification of remotely related sequences in multiple PSSM search approach , 2007, Proteins.

[16]  C A Orengo,et al.  Combining sensitive database searches with multiple intermediates to detect distant homologues. , 1999, Protein engineering.

[17]  L. Aravind,et al.  The many faces of the helix-turn-helix domain : Transcription regulation and beyond q , 2005 .

[18]  Michael Kuperberg,et al.  Markov Models , 2017, Arch. Formal Proofs.

[19]  Patrice Koehl,et al.  ASTRAL compendium enhancements , 2002, Nucleic Acids Res..

[20]  Oruganty Krishnadev,et al.  MulPSSM: a database of multiple position-specific scoring matrices of protein domain families , 2005, Nucleic Acids Res..

[21]  S. Balaji,et al.  SUPFAM: A database of sequence superfamilies of protein domains , 2004, BMC Bioinformatics.

[22]  Sean R. Eddy,et al.  Hidden Markov model speed heuristic and iterative HMM search procedure , 2010, BMC Bioinformatics.

[23]  David A. Lee,et al.  New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures , 2012, Nucleic Acids Res..

[24]  Dan S. Tawfik,et al.  Mutational effects and the evolution of new protein functions , 2010, Nature Reviews Genetics.

[25]  C. Sander,et al.  Dali: a network tool for protein structure comparison. , 1995, Trends in biochemical sciences.

[26]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.

[27]  David T. Jones,et al.  pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination , 2009, Bioinform..

[28]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[29]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[30]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[31]  D R Flower,et al.  The lipocalin protein family: structural and sequence overview. , 2000, Biochimica et biophysica acta.

[32]  Oruganty Krishnadev,et al.  AlignHUSH: Alignment of HMMs using structure and hydrophobicity information , 2011, BMC Bioinformatics.

[33]  Kimmen Sjölander,et al.  COACH : profile-profile alignment of protein families using hidden Markov models , 2003 .

[34]  L. Aravind,et al.  Small but versatile: the extraordinary functional and structural diversity of the β-grasp fold , 2007, Biology Direct.

[35]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[36]  Lenore Cowen,et al.  Augmented training of hidden Markov models to recognize remote homologs via simulated evolution , 2009, Bioinform..

[37]  V. Agrawal,et al.  OB-fold: growing bigger with functional consistency. , 2003, Current protein & peptide science.

[38]  L. Holm,et al.  Unification of protein families. , 1998, Current opinion in structural biology.

[39]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[40]  W. P. Russ,et al.  Evolutionary information for specifying a protein fold , 2005, Nature.

[41]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[42]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Nick V Grishin,et al.  Using protein design for homology detection and active site searches , 2003, Proceedings of the National Academy of Sciences of the United States of America.