Classifying Genomic Sequences by Sequence Feature Analysis

Traditional sequence analysis depends on sequence alignment. In this study, we analyzed various functional regions of the human genome based on sequence features, including word frequency, dinucleotide relative abundance, and base-base correlation. We analyzed the human chromosome 22 and classified the upstream, exon, intron, downstream, and intergenic regions by principal component analysis and discriminant analysis of these features. The results show that we could classify the functional regions of genome based on sequence feature and discriminant analysis.

[1]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[2]  J. Deragon,et al.  SINE Retroposons Can Be Used In Vivo as Nucleation Centers for De Novo Methylation , 2000, Molecular and Cellular Biology.

[3]  Ren Zhang,et al.  A nucleotide composition constraint of genome sequences , 2004, Comput. Biol. Chem..

[4]  Stephanie Andersen,et al.  Simple Guide to SPSS for Windows for Version 8.0 , 1998 .

[5]  D. P. Burma,et al.  Words in DNA sequences: some case studies based on their frequency statistics , 2003, Journal of mathematical biology.

[6]  M. Lyon,et al.  LINE-1 elements and X chromosome inactivation: a function for "junk" DNA? , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Lee A. Kirkpatrick,et al.  A Simple Guide to SPSS for Windows: For Versions 8.0 and 9.0. , 2000 .

[8]  Robert V. Hogg,et al.  Introduction to Mathematical Statistics. , 1966 .

[9]  W. A. Ericson Introduction to Mathematical Statistics, 4th Edition , 1972 .

[10]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[11]  Rickard Sandberg,et al.  Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. , 2003, Gene.

[12]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.