Species Identification Using Part of DNA Sequence: Evidence from Machine Learning Algorithms

In biological studies, species identification is considered one of the most important issues. Several methods have been suggested to identify species using the whole DNA sequences. In this study, we present new insights for species identification using only part of the DNA sequence. The Clustering k-Nearest Neighbor (K-C-NN) and Support Vector Machine (SVM) classifiers were used to test and evaluate the improved statistical features extracted from DNA sequences for four species (Aquifex aeolicus, Bacillus subtilis, Aeropyrum pernix and Buchnera sp). The results show that part of DNA sequences can be used to identify species.

[1]  Hamada R. H. Al-Absi,et al.  On the combination of wavelet and curvelet for feature extraction to classify lung cancer on chest radiographs , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[2]  Wei You,et al.  Classification of DNA Sequences Basing on the Dinucleotide Compositions , 2009, 2009 Second International Symposium on Computational Intelligence and Design.

[3]  Libin Liu,et al.  Clustering DNA sequences by feature vectors. , 2006, Molecular phylogenetics and evolution.

[4]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[5]  D. Tautz,et al.  A plea for DNA taxonomy , 2003 .

[6]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[7]  Willem Waegeman,et al.  Bacterial species identification from MALDI-TOF mass spectra through data analysis and machine learning. , 2011, Systematic and applied microbiology.

[8]  Lynne Boddy,et al.  Support vector machines for identifying organisms: a comparison with strongly partitioned radial basis function networks , 2001 .

[9]  Tae-Kun Seo,et al.  Classification of Nucleotide Sequences Using Support Vector Machines , 2010, Journal of Molecular Evolution.

[10]  Gary D. Stormo,et al.  DNA Sequence Classification Using DAWGs , 1997, Structures in Logic and Computer Science.

[11]  Baoshan Ma,et al.  An Improved Fourier Method for DNA Sequence Classification , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[12]  Wen-yuan Qiu,et al.  DNA sequences classification and computation scheme based on the symmetry principle , 2009 .

[13]  Dennis Shasha,et al.  DNA sequence classification via an expectation maximization algorithm and neural networks: a case study , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[14]  Jeremy R. deWaard,et al.  Biological identifications through DNA barcodes , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[15]  Samir Brahim Belhaouari Fast and Accuracy Control Chart Pattern Recognition using a New cluster-k-Nearest Neighbor , 2009 .

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Qingshan Jiang,et al.  A new method for classification in DNA sequence , 2011, 2011 6th International Conference on Computer Science & Education (ICCSE).