A complexity reduction algorithm for analysis and annotation of large genomic sequences.

DNA is a universal language encrypted with biological instruction for life. In higher organisms, the genetic information is preserved predominantly in an organized exon/intron structure. When a gene is expressed, the exons are spliced together to form the transcript for protein synthesis. We have developed a complexity reduction algorithm for sequence analysis (CRASA) that enables direct alignment of cDNA sequences to the genome. This method features a progressive data structure in hierarchical orders to facilitate a fast and efficient search mechanism. CRASA implementation was tested with already annotated genomic sequences in two benchmark data sets and compared with 15 annotation programs (10 ab initio and 5 homology-based approaches) against the EST database. By the use of layered noise filters, the complexity of CRASA-matched data was reduced exponentially. The results from the benchmark tests showed that CRASA annotation excelled in both the sensitivity and specificity categories. When CRASA was applied to the analysis of human Chromosomes 21 and 22, an additional 83 potential genes were identified. With its large-scale processing capability, CRASA can be used as a robust tool for genome annotation with high accuracy by matching the EST sequences precisely to the genomic sequences.

[1]  Pavel A. Pevzner,et al.  Las Vegas algorithms for gene recognition: suboptimal and error-tolerant spliced alignment , 1997, RECOMB '97.

[2]  Kun-Mao Chao,et al.  A tool for aligning very similar DNA sequences , 1997, Comput. Appl. Biosci..

[3]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[4]  A. Krogh,et al.  Using database matches with for HMMGene for automated gene detection in Drosophila. , 2000, Genome research.

[5]  P A Pevzner,et al.  Performance-guarantee gene predictions via spliced alignment. , 1998, Genomics.

[6]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  Y Xu,et al.  Recognizing exons in genomic sequence using GRAIL II. , 1994, Genetic engineering.

[9]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[10]  Victor V. Solovyev,et al.  Identification of Human Gene Structure Using Linear Discriminant Functions and Dynamic Programming , 1995, ISMB.

[11]  C. Fizames,et al.  Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence , 2000, Nature Genetics.

[12]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[13]  R. Durbin,et al.  Using GeneWise in the Drosophila annotation experiment. , 2000, Genome research.

[14]  Kun-Mao Chao,et al.  A local alignment tool for very long DNA sequences , 1995, Comput. Appl. Biosci..

[15]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[16]  J. Rommens,et al.  Generation of an integrated transcription map of the BRCA2 region on chromosome 13q12-q13. , 1996, Genomics.

[17]  P. Green,et al.  Analysis of expressed sequence tags indicates 35,000 human genes , 2000, Nature Genetics.

[18]  Luciano Milanesi,et al.  GeneBuilder: interactive in silico prediction of gene structure , 1999, Bioinform..

[19]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[20]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[21]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[22]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[23]  John Quackenbush,et al.  Gene Index analysis of the human genome estimates approximately 120,000 genes , 2000, Nature Genetics.

[24]  M. Hattori,et al.  The DNA sequence of human chromosome 21 , 2000, Nature.

[25]  Rakefet Rosenfeld Calculating the secrets of life , 1995, Nature.

[26]  Valentin I. Spitkovsky,et al.  A dictionary based approach for gene annotation , 1999, J. Comput. Biol..

[27]  Melanie E. Goward,et al.  The DNA sequence of human chromosome 22 , 1999, Nature.

[28]  J. Schug,et al.  GAIA: framework annotation of genomic sequence. , 1998, Genome research.

[29]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[30]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[31]  Anders Krogh,et al.  Chapter 4 - An introduction to hidden Markov models for biological sequences , 1998 .

[32]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[33]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[34]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[35]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[36]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[37]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[38]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[39]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[40]  Ying Xu,et al.  Gene Prediction by Pattern Recognition and Homology Search , 1996, ISMB.