SECOM: A Novel Hash Seed and Community Detection Based-Approach for Genome-Scale Protein Domain Identification

With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx.

[1]  G. Núñez,et al.  Cell death and immunity: NODs: intracellular proteins involved in inflammation and apoptosis , 2003, Nature Reviews Immunology.

[2]  L. Pennacchio,et al.  Genomic strategies to identify mammalian regulatory sequences , 2001, Nature Reviews Genetics.

[3]  G. Rose,et al.  Hierarchic organization of domains in globular proteins. , 1979, Journal of molecular biology.

[4]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[5]  T. Bhat,et al.  The Protein Data Bank and the challenge of structural genomics , 2000, Nature Structural Biology.

[6]  Nicholas H. Putnam,et al.  Sea Anemone Genome Reveals Ancestral Eumetazoan Gene Repertoire and Genomic Organization , 2007, Science.

[7]  Benjamin M. Wheeler,et al.  The dynamic genome of Hydra , 2010, Nature.

[8]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[9]  David Kim,et al.  Assessment of predictions submitted for the CASP7 domain prediction category , 2007, Proteins.

[10]  F. Martinon,et al.  NALPs: a novel protein family involved in inflammation , 2003, Nature Reviews Molecular Cell Biology.

[11]  K. Konaka,et al.  PYNOD, a novel Apaf-1/CED4-like protein is an inhibitor of ASC and caspase-1. , 2004, International immunology.

[12]  Bernard F. Buxton,et al.  Secondary structure prediction with support vector machines , 2003, Bioinform..

[13]  Jade Buchanan-Carter,et al.  Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx , 2009, BMC Genomics.

[14]  M. Lynch,et al.  Organellar genes: why do they end up in the nucleus? , 2000, Trends in genetics : TIG.

[15]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[16]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[17]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[19]  Nicholas H. Putnam,et al.  The Trichoplax genome and the nature of placozoans , 2008, Nature.

[20]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[21]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[22]  Nicholas H. Putnam,et al.  The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans , 2008, Nature.

[23]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[24]  R. Carter 11 – IT and society , 1991 .

[25]  Alfonso Valencia,et al.  Assessment of predictions submitted for the CASP7 function prediction category. , 2007, Proteins.

[26]  Todd H. Oakley,et al.  The Amphimedon queenslandica genome and the evolution of animal complexity , 2010, Nature.

[27]  Andrew R. Jackson,et al.  The Genome of the Sea Urchin Strongylocentrotus purpuratus , 2006, Science.

[28]  Bin Ma,et al.  Amino Acid Classification and Hash Seeds for Homology Search , 2009, BICoB.

[29]  Sarah A. Teichmann,et al.  DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins , 1998, Bioinform..

[30]  D. Wetlaufer Nucleation, rapid folding, and globular intrachain regions in proteins. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Yves A. Lussier,et al.  Evaluation of high-throughput functional categorization of human disease genes , 2007, BMC Bioinformatics.

[32]  J. Kumpula,et al.  Sequential algorithm for fast clique percolation. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[33]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[34]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.