Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA

We present a new methodology for discriminating genomic symbolic sequences, which combines wavelet analysis and a self-organizing map algorithm. Wavelets are used to extract variation across various scales in the oligonucleotide patterns of a sequence. The variation is quantified by the estimated wavelet variance, which yields a feature vector. Feature vectors obtained from many genomic sequences, possibly of different lengths, are then classified with a nonparametric self-organizing map scheme. When applied to nearly 200 entire mitochondrial DNA sequences, or their fragments, the method predicts species taxonomic group membership very well, and allows the results to be visualized. When only thousands of nucleotides are available, wavelet-based feature vectors of short oligonucleotide patterns are more efficient in discrimination than frequency-based feature vectors of long patterns. This new data analysis strategy could be extended to numeric genomic data. The routines needed to perform the computations are readily available in two packages of software R.

[1]  D. Rand,et al.  The Population Biology of Mitochondrial DNA and Its Phylogenetic Implications , 2005 .

[2]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[3]  Amit Konar,et al.  Biological Data Mining for Genomic Clustering Using Unsupervised Neural Learning , 2007, Eng. Lett..

[4]  Pietro Liò,et al.  Finding pathogenicity islands and gene transfer events in genome data , 2000, Bioinform..

[5]  Bhaskar D. Kulkarni,et al.  Identification of coding and non-coding sequences using local Hölder exponent formalism , 2005, Bioinform..

[6]  S. Kanaya,et al.  Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes. , 2006, Gene.

[7]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[8]  Todd R. Ogden,et al.  Wavelet Methods for Time Series Analysis , 2002 .

[9]  Cleveland P. Hickman,et al.  Integrated Principles of Zoology , 1970 .

[10]  D. Percival,et al.  M-estimation of wavelet variance , 2012 .

[11]  P. Phillips,et al.  Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? , 1992 .

[12]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[13]  Melody Y. Kiang,et al.  Extending the Kohonen self-organizing map networks for clustering analysis , 2002 .

[14]  Naryttza N. Diaz,et al.  Hyperbolic SOM-based clustering of DNA fragment features for taxonomic visualization and classification , 2008, Bioinform..