CLASSEQ: Classification of Sequences via Comparative Analysis of Multiple Genomes

CLASSEQ is a Web-based system for the analysis and comparison of uncharacterized protein sequences against multiple genomes. The user sequences are combined with protein sequences from the user-specified genomes and then clustered using our in-house fast clustering algorithm, BAG. The pre-computed genome-to-genome pairwise comparison database, PCDB, makes our service fast enough to be provided on the Web even though the analysis typically involves tens of thousands of sequences. Clusters containing the user input sequences can be further characterized by domain search, multiple sequence alignment, phylogenetic tree analysis, and gene neighborhood analysis. This Web service is a useful resource for characterizing proteins of unknown functions via comparative genomics approach. CLASSEQ is available at http://platcom.org/CLASSEQ.

[1]  W R Pearson,et al.  Using the FASTA program to search protein and DNA sequence databases. , 1994, Methods in molecular biology.

[2]  Jean-Luc Gauvain,et al.  Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  D. Marquardt An Algorithm for Least-Squares Estimation of Nonlinear Parameters , 1963 .

[4]  Rafael Dueire Lins,et al.  Automatic language identification of written texts , 2004, SAC '04.

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  Yu Ma,et al.  PLATCOM: a Platform for Computational Comparative Genomics , 2005, Bioinform..

[7]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[8]  Amit Saple,et al.  A hybrid gene team model and its application to genome analysis. , 2006, Journal of bioinformatics and computational biology.

[9]  Ibrahim Sogukpinar,et al.  Letter Based Text Scoring Method for Language Identification , 2004, ADVIS.

[10]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[11]  Jason Lee,et al.  BAG: a graph theoretic sequence clustering algorithm , 2006, Int. J. Data Min. Bioinform..

[12]  Jilei Tian,et al.  n-gram and decision tree based language identification for written words , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[13]  Ibrahim Sogukpinar,et al.  Centroid-Based Language Identification Using Letter Feature Set , 2004, CICLing.

[14]  Katrin Kirchhoff,et al.  Multi-stream language identification using data-driven dependency selection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Jiong Yang,et al.  Gene teams with relaxed proximity constraint , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[16]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..