High-quality sequence clustering guided by network topology and multiple alignment likelihood

MOTIVATION Proteins can be naturally classified into families of homologous sequences that derive from a common ancestor. The comparison of homologous sequences and the analysis of their phylogenetic relationships provide useful information regarding the function and evolution of genes. One important difficulty of clustering methods is to distinguish highly divergent homologous sequences from sequences that only share partial homology due to evolution by protein domain rearrangements. Existing clustering methods require parameters that have to be set a priori. Given the variability in the evolution pattern among proteins, these parameters cannot be optimal for all gene families. RESULTS We propose a strategy that aims at clustering sequences homologous over their entire length, and that takes into account the pattern of substitution specific to each gene family. Sequences are first all compared with each other and clustered into pre-families, based on pairwise similarity criteria, with permissive parameters to optimize sensitivity. Pre-families are then divided into homogeneous clusters, based on the topology of the similarity network. Finally, clusters are progressively merged into families, for which we compute multiple alignments, and we use a model selection technique to find the optimal tradeoff between the number of families and multiple alignment likelihood. To evaluate this method, called HiFiX, we analyzed simulated sequences and manually curated datasets. These tests showed that HiFiX is the only method robust to both sequence divergence and domain rearrangements. HiFiX is fast enough to be used on very large datasets. AVAILABILITY AND IMPLEMENTATION The Python software HiFiX is freely available at http://lbbe.univ-lyon1.fr/hifix.

[1]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[2]  Yanchi Liu,et al.  Community detection in graphs through correlation , 2014, KDD.

[3]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[5]  Shrikanth S. Narayanan,et al.  Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tao Liu,et al.  TreeFam: 2008 Update , 2007, Nucleic Acids Res..

[7]  John H. Morris,et al.  Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution , 2011, Bioinform..

[8]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[9]  Franck Picard,et al.  Deciphering the connectivity structure of biological networks using MixNet , 2009, BMC Bioinformatics.

[10]  W. Pearson,et al.  Homologous over-extension: a challenge for iterative similarity searches , 2010, Nucleic acids research.

[11]  Charbel Niño El-Hani,et al.  Detecting Network Communities: An Application to Phylogenetic Analysis , 2011, PLoS Comput. Biol..

[12]  Claudio Donati,et al.  Protein Homology Network Families Reveal Step-Wise Diversification of Type III and Type IV Secretion Systems , 2006, PLoS Comput. Biol..

[13]  Guy Perrière,et al.  Databases of homologous gene families for comparative genomics , 2009, BMC Bioinformatics.

[14]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[15]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Jian-Huang Lai,et al.  Phylogeny Inference Based on Spectral Graph Clustering , 2011, J. Comput. Biol..

[17]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[18]  Vincent Miele,et al.  Ultra-fast sequence clustering from similarity networks with SiLiX , 2011, BMC Bioinformatics.

[19]  Shoshana D. Brown,et al.  A gold standard set of mechanistically diverse enzyme superfamilies , 2006, Genome Biology.

[20]  Dorothea Emig,et al.  Partitioning biological data with transitivity clustering , 2010, Nature Methods.

[21]  Thomas E. Ferrin,et al.  Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies , 2009, PloS one.

[22]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[23]  Jérôme Gouzy,et al.  The ProDom database of protein domain families , 1998, Nucleic Acids Res..

[24]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[25]  Dannie Durand,et al.  Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins , 2008, PLoS Comput. Biol..

[26]  Michael Y. Galperin,et al.  Diversity of structure and function of response regulator output domains. , 2010, Current opinion in microbiology.

[27]  Kazutaka Katoh,et al.  Multiple alignment of DNA sequences with MAFFT. , 2009, Methods in molecular biology.

[28]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[29]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[30]  Like Fokkens,et al.  Enrichment of homologs in insignificant BLAST hits by co-complex network alignment , 2010, BMC Bioinformatics.

[31]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[32]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[33]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockstructures , 2001 .

[34]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[35]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.