Subgrouping Automata: Automatic sequence subgrouping using phylogenetic tree-based optimum subgrouping algorithm

Sequence subgrouping for a given sequence set can enable various informative tasks such as the functional discrimination of sequence subsets and the functional inference of unknown sequences. Because an identity threshold for sequence subgrouping may vary according to the given sequence set, it is highly desirable to construct a robust subgrouping algorithm which automatically identifies an optimal identity threshold and generates subgroups for a given sequence set. To meet this end, an automatic sequence subgrouping method, named 'Subgrouping Automata' was constructed. Firstly, tree analysis module analyzes the structure of tree and calculates the all possible subgroups in each node. Sequence similarity analysis module calculates average sequence similarity for all subgroups in each node. Representative sequence generation module finds a representative sequence using profile analysis and self-scoring for each subgroup. For all nodes, average sequence similarities are calculated and 'Subgrouping Automata' searches a node showing statistically maximum sequence similarity increase using Student's t-value. A node showing the maximum t-value, which gives the most significant differences in average sequence similarity between two adjacent nodes, is determined as an optimum subgrouping node in the phylogenetic tree. Further analysis showed that the optimum subgrouping node from SA prevents under-subgrouping and over-subgrouping.

[1]  H Hayashi,et al.  Crystal structures of Paracoccus denitrificans aromatic amino acid aminotransferase: a substrate recognition site constructed by rearrangement of hydrogen bond network. , 1998, Journal of molecular biology.

[2]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[3]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[4]  P. Christen,et al.  Aminotransferases: demonstration of homology and division into evolutionary subgroups. , 1993, European journal of biochemistry.

[5]  Alfonso Valencia,et al.  Clustering of proximal sequence space for the identification of protein families , 2002, Bioinform..

[6]  C. Fraser,et al.  Phylogenomics: Intersection of Evolution and Genomics , 2003, Science.

[7]  H Hayashi,et al.  Paracoccus denitrificans aromatic amino acid aminotransferase: a model enzyme for the study of dual substrate recognition mechanism. , 1997, Journal of biochemistry.

[8]  Martin Vingron,et al.  SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein , 2002, Nucleic Acids Res..

[9]  Rob Knight,et al.  DivergentSet, a Tool for Picking Non-redundant Sequences from Large Sequence Collections* , 2006, Molecular & Cellular Proteomics.

[10]  Duncan P. Brown,et al.  Efficient functional clustering of protein sequences using the Dirichlet process , 2008, Bioinform..

[11]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[12]  David A. Lee,et al.  GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.

[13]  L Holm,et al.  Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[16]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[17]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..