Density parameter estimation for finding clusters of homologous proteins - tracing actinobacterial pathogenicity lifestyles

MOTIVATION Homology detection is a long-standing challenge in computational biology. To tackle this problem, typically all-versus-all BLAST results are coupled with data partitioning approaches resulting in clusters of putative homologous proteins. One of the main problems, however, has been widely neglected: all clustering tools need a density parameter that adjusts the number and size of the clusters. This parameter is crucial but hard to estimate without gold standard data at hand. Developing a gold standard, however, is a difficult and time consuming task. Having a reliable method for detecting clusters of homologous proteins between a huge set of species would open opportunities for better understanding the genetic repertoire of bacteria with different lifestyles. RESULTS Our main contribution is a method for identifying a suitable and robust density parameter for protein homology detection without a given gold standard. Therefore, we study the core genome of 89 actinobacteria. This allows us to incorporate background knowledge, i.e. the assumption that a set of evolutionarily closely related species should share a comparably high number of evolutionarily conserved proteins (emerging from phylum-specific housekeeping genes). We apply our strategy to find genes/proteins that are specific for certain actinobacterial lifestyles, i.e. different types of pathogenicity. The whole study was performed with transitivity clustering, as it only requires a single intuitive density parameter and has been shown to be well applicable for the task of protein sequence clustering. Note, however, that the presented strategy generally does not depend on our clustering method but can easily be adapted to other clustering approaches. AVAILABILITY All results are publicly available at http://transclust.mmci.uni-saarland.de/actino_core/ or as Supplementary Material of this article. CONTACT roettger@mpi-inf.mpg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  S. Böcker,et al.  Comprehensive cluster analysis with Transitivity Clustering , 2011, Nature Protocols.

[2]  Enrique Blanco,et al.  Computational gene annotation in new genome assemblies using GeneID. , 2009, Methods in molecular biology.

[3]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[4]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[5]  A. Miyoshi,et al.  Corynebacterium pseudotuberculosis: microbiology, biochemical properties, pathogenesis and molecular studies of virulence. , 2006, Veterinary research.

[6]  Sven Rahmann,et al.  Exact and heuristic algorithms for weighted cluster editing. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[7]  Radhey S. Gupta,et al.  Microbial systematics in the post-genomics era , 2011, Antonie van Leeuwenhoek.

[8]  James J. Davis,et al.  Similarity of genes horizontally acquired by Escherichia coli and Salmonella enterica is evidence of a supraspecies pangenome , 2011, Proceedings of the National Academy of Sciences.

[9]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[10]  E. Stackebrandt,et al.  Proteobacteria classis nov., a Name for the Phylogenetic Taxon That Includes the “Purple Bacteria and Their Relatives” , 1988 .

[11]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[12]  Radhey S. Gupta,et al.  Phylogenetic Framework and Molecular Signatures for the Main Clades of the Phylum Actinobacteria , 2012, Microbiology and Molecular Reviews.

[13]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[14]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[15]  A. Tauch,et al.  Genomics of Actinobacteria: Tracing the Evolutionary History of an Ancient Phylum , 2007, Microbiology and Molecular Biology Reviews.

[16]  Gaston H. Gonnet,et al.  OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements , 2005, Comparative Genomics.

[17]  L. Williamson Caseous lymphadenitis in small ruminants. , 2001, The Veterinary clinics of North America. Food animal practice.

[18]  Sven Rahmann,et al.  Extension and Robustness of Transitivity Clustering for Protein–Protein Interaction Network Analysis , 2011, Internet Math..

[19]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[20]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[21]  J. Davies,et al.  Actinobacteria: the good, the bad, and the ugly , 2010, Antonie van Leeuwenhoek.

[22]  Radhey S. Gupta,et al.  Signature proteins that are distinctive characteristics of Actinobacteria and their subgroups , 2006, Antonie van Leeuwenhoek.

[23]  Alexander C. J. Roth,et al.  Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits , 2006, Nucleic acids research.

[24]  Vasily Tcherepanov,et al.  Genome Annotation Transfer Utility (GATU): rapid annotation of viral genomes using a closely related reference genome , 2006, BMC Genomics.

[25]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[26]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[27]  Dorothea Emig,et al.  Partitioning biological data with transitivity clustering , 2010, Nature Methods.

[28]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.