论文信息 - Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification

Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification

The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms, namely Convolutional Neural Networks, Deep Neural Networks, and N-gram Probabilistic Models, are used for the task of DNA classification. Furthermore, we introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences to compute information-rich features from the DNA sequences. We also use an existing feature extraction method based on 3-grams to represent amino acids and combine both feature extraction methods with a multitude of machine learning algorithms. Four different data sets, each concerning viral diseases such as Covid-19, AIDS, Influenza, and Hepatitis C, are used for evaluating the different approaches. The results of the experiments show that all methods obtain high accuracies on the different DNA datasets. Furthermore, the domain-specific 3-gram feature extraction method leads in general to the best results in the experiments, while the newly proposed technique outperforms all other methods on the smallest Covid-19 dataset

[1] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2] W. Weddington,et al. Organic mental disorders caused by HIV. , 1991, The American journal of psychiatry.

[3] K. Yuen,et al. Clinical Characteristics of Coronavirus Disease 2019 in China , 2020, The New England journal of medicine.

[4] Yann LeCun,et al. Generalization and network design strategies , 1989 .

[5] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[6] Yun Peng,et al. A private DNA motif finding algorithm , 2014, J. Biomed. Informatics.

[7] Andrew D. Johnson,et al. Bioinformatics for Geneticists: A Bioinformatics Primer for the Analysis of Genetic Data , 2008 .

[8] Bernhard E. Boser,et al. A training algorithm for optimal margin classifiers , 1992, COLT '92.

[9] S. Brenner,et al. General Nature of the Genetic Code for Proteins , 1961, Nature.

[10] Rasa Bernotienė,et al. Primers targeting mitochondrial genes of avian haemosporidians: PCR detection and differential DNA amplification of parasites belonging to different genera. , 2018, International journal for parasitology.

[11] S. Rigatti. Random Forest. , 2017, Journal of insurance medicine.

[12] Parviz Keshavarzi,et al. A novel MLP network implementation in CMOL technology , 2014 .

[13] C. Hölzel,et al. Specific amplification of bacterial DNA by optimized so-called universal bacterial primers in samples rich of plant DNA. , 2015, Journal of microbiological methods.

[14] Benjamin V. Tucker,et al. The effects of N-gram probabilistic measures on the recognition and production of four-word sequences , 2011 .

[15] Kenji Satou,et al. DNA Sequence Classification by Convolutional Neural Network , 2016 .

[16] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.

[18] Gokhan Ertas,et al. Detection of high GS risk group prostate tumors by diffusion tensor imaging and logistic regression modelling. , 2018, Magnetic resonance imaging.

[19] Clara Fannjiang,et al. A deep learning approach to pattern recognition for short DNA sequences , 2018, bioRxiv.

[20] B. Grady,et al. Hepatitis C virus : risk factors and disease progression , 2015 .

[21] Yasubumi Sakakibara,et al. Convolutional neural networks for classification of alignments of non-coding RNA sequences , 2018, Bioinform..

[22] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[23] R H Purcell,et al. Importance of primer selection for the detection of hepatitis C virus RNA with the polymerase chain reaction assay. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24] Jalal Poorolajal,et al. A comparative study of support vector machines and artificial neural networks for predicting precipitation in Iran , 2014, Theoretical and Applied Climatology.

[25] Teresita M. Porter,et al. Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error , 2012, PloS one.

[26] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[27] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28] M. Tibayrenc,et al. Classification of plant trypanosomatids (Phytomonas spp.): parity between random-primer DNA typing and multilocus enzyme electrophoresis , 1997, Parasitology.

[29] M S Gelfand,et al. Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.