Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification

The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms, namely Convolutional Neural Networks, Deep Neural Networks, and N-gram Probabilistic Models, are used for the task of DNA classification. Furthermore, we introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences to compute information-rich features from the DNA sequences. We also use an existing feature extraction method based on 3-grams to represent amino acids and combine both feature extraction methods with a multitude of machine learning algorithms. Four different data sets, each concerning viral diseases such as Covid-19, AIDS, Influenza, and Hepatitis C, are used for evaluating the different approaches. The results of the experiments show that all methods obtain high accuracies on the different DNA datasets. Furthermore, the domain-specific 3-gram feature extraction method leads in general to the best results in the experiments, while the newly proposed technique outperforms all other methods on the smallest Covid-19 dataset

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  W. Weddington,et al.  Organic mental disorders caused by HIV. , 1991, The American journal of psychiatry.

[3]  K. Yuen,et al.  Clinical Characteristics of Coronavirus Disease 2019 in China , 2020, The New England journal of medicine.

[4]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[5]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[6]  Yun Peng,et al.  A private DNA motif finding algorithm , 2014, J. Biomed. Informatics.

[7]  Andrew D. Johnson,et al.  Bioinformatics for Geneticists: A Bioinformatics Primer for the Analysis of Genetic Data , 2008 .

[8]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[9]  S. Brenner,et al.  General Nature of the Genetic Code for Proteins , 1961, Nature.

[10]  Rasa Bernotienė,et al.  Primers targeting mitochondrial genes of avian haemosporidians: PCR detection and differential DNA amplification of parasites belonging to different genera. , 2018, International journal for parasitology.

[11]  S. Rigatti Random Forest. , 2017, Journal of insurance medicine.

[12]  Parviz Keshavarzi,et al.  A novel MLP network implementation in CMOL technology , 2014 .

[13]  C. Hölzel,et al.  Specific amplification of bacterial DNA by optimized so-called universal bacterial primers in samples rich of plant DNA. , 2015, Journal of microbiological methods.

[14]  Benjamin V. Tucker,et al.  The effects of N-gram probabilistic measures on the recognition and production of four-word sequences , 2011 .

[15]  Kenji Satou,et al.  DNA Sequence Classification by Convolutional Neural Network , 2016 .

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[18]  Gokhan Ertas,et al.  Detection of high GS risk group prostate tumors by diffusion tensor imaging and logistic regression modelling. , 2018, Magnetic resonance imaging.

[19]  Clara Fannjiang,et al.  A deep learning approach to pattern recognition for short DNA sequences , 2018, bioRxiv.

[20]  B. Grady,et al.  Hepatitis C virus : risk factors and disease progression , 2015 .

[21]  Yasubumi Sakakibara,et al.  Convolutional neural networks for classification of alignments of non-coding RNA sequences , 2018, Bioinform..

[22]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[23]  R H Purcell,et al.  Importance of primer selection for the detection of hepatitis C virus RNA with the polymerase chain reaction assay. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Jalal Poorolajal,et al.  A comparative study of support vector machines and artificial neural networks for predicting precipitation in Iran , 2014, Theoretical and Applied Climatology.

[25]  Teresita M. Porter,et al.  Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error , 2012, PloS one.

[26]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  M. Tibayrenc,et al.  Classification of plant trypanosomatids (Phytomonas spp.): parity between random-primer DNA typing and multilocus enzyme electrophoresis , 1997, Parasitology.

[29]  M S Gelfand,et al.  Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.