ECPF: an efficient algorithm for expanding clustered protein families

With the quick development of gene sequencing technology, the explosion age marked by protein sequences has already come. How to deal with a huge number of protein sequences has aroused serious concern in the research field. An effective solution is to cluster homologous sequences into separated protein families. Those proteins that are affiliated to the same protein family share the similar structure and/or the functionality of genes. The known proteins will facilitate to identify various valuable evidences for discovering the unknown proteins. We present an efficient and effective algorithm called Expanding Clustered Protein Families (ECPF), which could skilfully optimise the clustered protein sequences. The results show that ECPF is capable of discovering the unknown connections between storing space and families in large-scale databases while consuming acceptable overhead of computational time. ECPF successfully expands the protein sequence network and furthermore creates a more practical protein sequence topology for promoting biological research.

[1]  Shoshana D. Brown,et al.  A gold standard set of mechanistically diverse enzyme superfamilies , 2006, Genome Biology.

[2]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[3]  Bin Gu,et al.  A Robust Regularization Path Algorithm for $\nu $ -Support Vector Classification , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Bin Gu,et al.  Incremental Support Vector Learning for Ordinal Regression , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[6]  Chris Sander,et al.  CAST: an iterative algorithm for the complexity analysis of sequence tracts , 2000, Bioinform..

[7]  Franck Picard,et al.  High-quality sequence clustering guided by network topology and multiple alignment likelihood , 2012, Bioinform..

[8]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[9]  Michael J. E. Sternberg,et al.  Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe , 2010, Bioinform..

[10]  Johannes Söding,et al.  kClust: fast and sensitive clustering of large protein sequence databases , 2013, BMC Bioinformatics.

[11]  Gerson Zaverucha,et al.  Evaluation and improvements of clustering algorithms for detecting remote homologous protein families , 2015, BMC Bioinformatics.

[12]  N Linial,et al.  ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space , 1999, Proteins.

[13]  Stijn van Dongen,et al.  Using MCL to extract clusters from networks. , 2012, Methods in molecular biology.

[14]  Guy Perrière,et al.  Databases of homologous gene families for comparative genomics , 2009, BMC Bioinformatics.

[15]  Vincent Miele,et al.  Ultra-fast sequence clustering from similarity networks with SiLiX , 2011, BMC Bioinformatics.