Application of latent semantic indexing to evaluate the similarity of sets of sequences without multiple alignments character-by-character.

Most molecular analyses, including phylogenetic inference, are based on sequence alignments. We present an algorithm that estimates relatedness between biomolecules without the requirement of sequence alignment by using a protein frequency matrix that is reduced by singular value decomposition (SVD), in a latent semantic index information retrieval system. Two databases were used: one with 832 proteins from 13 mitochondrial gene families and another composed of 1000 sequences from nine types of proteins retrieved from GenBank. Firstly, 208 sequences from the first database and 200 from the second were randomly selected and compared using edit distance between each pair of sequences and respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). In order to check the ability of SVD in classifying sequences according to their categories, we used a sample of 202 sequences from the 13 gene families as queries (test set), and the other proteins (630) were used to generate the frequency matrix (training set). The classification algorithm applies a voting scheme based on the five most similar sequences with each query. With a 3-peptide frequency matrix, all 202 queries were correctly classified (accuracy = 100%). This algorithm is very attractive, because sequence alignments are neither generated nor required. In order to achieve results similar to those obtained with edit distance analysis, we recommend that Euclidean distance be used as a similarity measure for protein sequences in latent semantic indexing methods.

[1]  S. Schreiber,et al.  Vector algebra in the analysis of genome-wide expression data , 2002, Genome Biology.

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  A. Braga,et al.  Clustering and artificial neural networks: classification of variable lengths of Helminth antigens in set of domains , 2004 .

[4]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[5]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[6]  J. Leader,et al.  A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. , 2002, Molecular biology and evolution.

[7]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[8]  J. Thorne,et al.  Models of protein sequence evolution and their applications. , 2000, Current opinion in genetics & development.

[9]  Eric R. Ziegel,et al.  Applied Multivariate Data Analysis , 2002, Technometrics.

[10]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[11]  Michael W. Berry,et al.  An SVD-based comparison of nine whole eukaryotic genomes supports a coelomate rather than ecdysozoan lineage , 2004, BMC Bioinformatics.

[12]  Stephen A. Krawetz,et al.  Introduction to Bioinformatics: A Theoretical and Practical Approach , 2003 .

[13]  T DumaisSusan,et al.  Using linear algebra for intelligent information retrieval , 1995 .

[14]  B. Everitt,et al.  Applied Multivariate Data Analysis. , 1993 .

[15]  Viviane Moreira Orengo,et al.  Assessing relevance using automatically translated documents for cross-language information retrieval , 2004 .

[16]  C. Williams Applied Multivariate Data Analysis (2nd Edition) , 2002 .

[17]  Michael W. Berry,et al.  A Comprehensive Whole Genome Bacterial Phylogeny Using Correlated Peptide Motifs Defined in a High Dimensional Vector Space , 2003, J. Bioinform. Comput. Biol..

[18]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[19]  Robert J. Schalkoff,et al.  Pattern recognition - statistical, structural and neural approaches , 1991 .

[20]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..