论文信息 - Large scale hierarchical clustering of protein sequences

Large scale hierarchical clustering of protein sequences

BackgroundSearching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to.ResultsWe report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/.ConclusionsComparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.

[1] Jungwon Yoon,et al. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community , 2003, Nucleic Acids Res..

[2] Robert S. Ledley,et al. PIRSF: family classification system at the Protein Information Resource , 2004, Nucleic Acids Res..

[3] R. Sharan,et al. CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[4] Maria Jesus Martin,et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[5] Anton J. Enright,et al. An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[6] B. Barrell,et al. The genome sequence of Schizosaccharomyces pombe , 2002, Nature.

[7] Martin Vingron,et al. A set-theoretic approach to database searching and clustering , 1998, Bioinform..

[8] Amos Bairoch,et al. The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[9] Roded Sharan,et al. Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[10] Yoichi Takenaka,et al. Graph-based clustering for finding distant relationships in a large set of protein sequences , 2004, Bioinform..

[11] Rolf Apweiler,et al. Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters , 2003, Nucleic Acids Res..

[12] L Holm,et al. Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[13] Nathan Linial,et al. ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[14] William L. Ditto,et al. Principles and applications of chaotic systems , 1995, CACM.

[15] Kara Dolinski,et al. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms , 2004, Nucleic Acids Res..

[16] Cathy H. Wu,et al. iProClass: an integrated, comprehensive and annotated protein classification database , 2001, Nucleic Acids Res..

[17] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[18] Burkhard Rost,et al. Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[19] Jon Kleinberg,et al. The Structure of the Web , 2001, Science.

[20] Ori Sasson,et al. ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[21] Ron Shamir,et al. An algorithm for clustering cDNAs for gene expression analysis , 1999, RECOMB.

[22] Damian Smedley,et al. Ensembl 2004 , 2004, Nucleic Acids Res..

[23] Kurt Mehlhorn,et al. LEDA: a platform for combinatorial and geometric computing , 1997, CACM.

[24] Martin Vingron,et al. The SYSTERS Protein Family Database in 2005 , 2004, Nucleic Acids Res..