Genetic Sequence Classification and its Application to Cross-Species Homology Detection

Although large-scale classification studies of genetic sequence data are in progress around the world, very few studies compare different classification approaches, e.g. unsupervised and supervised, in terms of objective criteria such as classification accuracy and computational complexity. In this paper, we study such criteria for both unsupervised and supervised classification of a relatively large sequence data set. The unsupervised approach involves use of different sequence alignment algorithms (e.g., Smith-Waterman, FASTA and BLAST) followed by clustering using the Maximin algorithm. The supervised approach uses a suitable numeric encoding (relative frequencies of tuples of nucleotides followed by principal component analysis) which is fed to a Multi-layer Backpropagation Neural Network. Classification experiments conducted on IBM-SP parallel computers show that FASTA with unsupervised Maximin leads to best trade-off between accuracy and speed among all methods, followed by supervised neural networks as the second best approach. Finally, the different classifiers are applied to the problem of cross-species homology detection.

[1]  Susan Carpenter,et al.  PAQ: Partition Analysis of Quasispecies , 2001, Bioinform..

[2]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[3]  Javed Mostafa,et al.  A multilevel approach to intelligent information filtering: model, system, and evaluation , 1997, TOIS.

[4]  Amanda Clare,et al.  The utility of different representations of protein sequence for predicting functional class , 2001, Bioinform..

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[8]  Rainer Fuchs,et al.  Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters , 2001, Bioinform..

[9]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[10]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[11]  Rolf Apweiler,et al.  A novel method for automatic functional annotation of proteins , 1999, Bioinform..

[12]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[13]  A Kolinski,et al.  Neural network system for the evaluation of side-chain packing in protein structures. , 1995, Protein engineering.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Martin Vingron,et al.  A set-theoretic approach to database searching and clustering , 1998, Bioinform..

[16]  Snehasis Mukhopadhyay,et al.  A comparative study of genetic sequence classification algorithms , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[17]  G. N. Lance,et al.  Computer Programs for Hierarchical Polythetic Classification ("Similarity Analyses") , 1966, Comput. J..

[18]  P. Deloukas,et al.  A Gene Map of the Human Genome , 1996, Science.

[19]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[20]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[21]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[22]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[23]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[24]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[25]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[26]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[27]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.