Classification of Sequences with Deep Artificial Neural Networks: Representation and Architectural Issues

DNA sequences are the basic data type that is processed to perform a generic study of biological data analysis. One key component of the biological analysis is represented by sequence classification, a methodology that is widely used to analyze sequential data of different nature. However, its application to DNA sequences requires a proper representation of such sequences, which is still an open research problem. Machine Learning (ML) methodologies have given a fundamental contribution to the solution of the problem. Among them, recently, also Deep Neural Network (DNN) models have shown strongly encouraging results. In this chapter, we deal with specific classification problems related to two biological scenarios: (A) metagenomics and (B) chromatin organization. The investigations have been carried out by considering DNA sequences as input data for the classification methodologies. In particular, we study and test the efficacy of (1) different DNA sequence representations and (2) several Deep Learning (DL) architectures that process sequences for the solution of the related supervised classification problems. Although developed for specific classification tasks, we think that such architectures could be served as a suggestion for developing other DNN models that process the same kind of input.

[1]  Mattia Antonino Di Gangi,et al.  Deep Learning Architectures for DNA Sequence Classification , 2016, WILF.

[2]  Nir Friedman,et al.  High-resolution nucleosome mapping reveals transcription-dependent promoter packaging. , 2010, Genome research.

[3]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[4]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Yu Li,et al.  Deep learning in bioinformatics: introduction, application, and perspective in big data era , 2019, bioRxiv.

[7]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[8]  M. Grunstein,et al.  Functions of site-specific histone acetylation and deacetylation. , 2007, Annual review of biochemistry.

[9]  Lei Wang,et al.  LeNup: learning nucleosome positioning from DNA sequences with improved convolutional neural networks , 2018, Bioinform..

[10]  E. Frenkel,et al.  Metagenomic Shotgun Sequencing and Unbiased Metabolomic Profiling Identify Specific Human Gut Microbiota and Metabolites Associated with Immune Checkpoint Therapy Efficacy in Melanoma Patients , 2017, Neoplasia.

[11]  Antonino Fiannaca,et al.  Deep learning models for bacteria taxonomic classification of metagenomic data , 2018, BMC Bioinformatics.

[12]  J. Svaren,et al.  Transcription factors vs nucleosomes: regulation of the PHO5 promoter in yeast. , 1997, Trends in biochemical sciences.

[13]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  E. H. Simpson Measurement of Diversity , 1949, Nature.

[16]  Salvatore Gaglio,et al.  A Deep Learning Network for Exploiting Positional Information in Nucleosome Related Sequences , 2017, IWBBIO.

[17]  Trygve Almøy,et al.  Comparing K-mer based methods for improved classification of 16S sequences , 2015, BMC Bioinformatics.

[18]  James R. Cole,et al.  Reconstructing 16S rRNA genes in metagenomic data , 2015, Bioinform..

[19]  Guido Montúfar,et al.  Restricted Boltzmann Machines: Introduction and Review , 2018, ArXiv.

[20]  Yu Li,et al.  Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. , 2019, Methods.

[21]  Toshio Tsukiyama,et al.  Antagonistic forces that position nucleosomes in vivo , 2006, Nature Structural &Molecular Biology.

[22]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[23]  Antonino Fiannaca,et al.  A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network , 2015, Artif. Intell. Medicine.

[24]  Giosuè Lo Bosco,et al.  Applications of alignment-free methods in epigenomics , 2014, Briefings Bioinform..

[25]  S. Elgin,et al.  Nucleosome positioning and gene regulation , 1994, Journal of cellular biochemistry.

[26]  Giosuè Lo Bosco,et al.  A motif-independent metric for DNA sequence specificity , 2011, BMC Bioinformatics.

[27]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[28]  Antonino Fiannaca,et al.  Classification Experiments of DNA Sequences by Using a Deep Neural Network and Chaos Game Representation , 2016, CompSysTech.

[29]  riboFrame: An Improved Method for Microbial Taxonomy Profiling from Non-Targeted Metagenomics , 2015, Frontiers in genetics.

[30]  John C. Wooley,et al.  Metagenomics: Facts and Artifacts, and Computational Challenges , 2010, Journal of Computer Science and Technology.

[31]  Antonino Fiannaca,et al.  A Deep Learning Approach to DNA Sequence Classification , 2015, CIBB.

[32]  Giosuè Lo Bosco,et al.  A New Feature Selection Methodology for K-mers Representation of DNA Sequences , 2014, CIBB.

[33]  Irene K. Moore,et al.  The DNA-encoded nucleosome organization of a eukaryotic genome , 2009, Nature.

[34]  Raffaele Giancarlo,et al.  Genome‐wide characterization of chromatin binding and nucleosome spacing activity of the nucleosome remodelling ATPase ISWI , 2011, The EMBO journal.

[35]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[36]  Antonino Fiannaca,et al.  Analysis of DNA Barcode Sequences Using Neural Gas and Spectral Representation , 2013, EANN.

[37]  A. Sanchez-Flores,et al.  The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics , 2015, Front. Genet..

[38]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[39]  B. Dujon,et al.  The genomic tree as revealed from whole proteome comparisons. , 1999, Genome research.

[40]  Peter A. Jones,et al.  The Epigenomics of Cancer , 2007, Cell.

[41]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[42]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[43]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[44]  Dong-Ho Cho,et al.  Classification of various genomic sequences based on distribution of repeated k-word , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[45]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[46]  S. Lal,et al.  The Human Gut Microbiome – A Potential Controller of Wellness and Disease , 2018, Front. Microbiol..

[47]  Umberto Ferraro Petrillo,et al.  Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics , 2018, BMC Bioinformatics.

[48]  B. Cairns,et al.  Chromatin remodeling complexes: strength in diversity, precision through specialization. , 2005, Current opinion in genetics & development.

[49]  Lila Kari,et al.  The spectrum of genomic signatures: from dinucleotides to chaos game representation. , 2005, Gene.

[50]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[51]  Antonino Fiannaca,et al.  nRC: non-coding RNA Classifier based on structural features , 2017, BioData Mining.

[52]  Mattia Antonino Di Gangi,et al.  Recurrent Deep Neural Networks for Nucleosome Classification , 2018, CIBB.

[53]  Raffaele Giancarlo,et al.  The Three Steps of Clustering in the Post-Genomic Era: A Synopsis , 2010, CIBB.

[54]  Xiaodong Gu,et al.  Towards dropout training for convolutional neural networks , 2015, Neural Networks.

[55]  Antonino Fiannaca,et al.  Variable Ranking Feature Selection for the Identification of Nucleosome Related Sequences , 2018, ADBIS.

[56]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[57]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[58]  A. Eggermont,et al.  Baseline gut microbiota predicts clinical response and colitis in metastatic melanoma patients treated with ipilimumab , 2017, Annals of oncology : official journal of the European Society for Medical Oncology.

[59]  G. Almouzni,et al.  Chromatin assembly and organization. , 2001, Journal of cell science.

[60]  Giosuè Lo Bosco Alignment Free Dissimilarities for Nucleosome Classification , 2015, CIBB.

[61]  Riccardo Rizzo,et al.  Deep learning architectures for prediction of nucleosome positioning from sequences data , 2018, BMC Bioinformatics.

[62]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[63]  Xiao Sun,et al.  Role of 10-11 bp periodicities of eukaryotic DNA sequence in nucleosome positioning , 2011, Biosyst..

[64]  Michael Y Tolstorukov,et al.  Regulated large-scale nucleosome density patterns and precise nucleosome positioning correlate with V(D)J recombination , 2016, Proceedings of the National Academy of Sciences.

[65]  Antonino Fiannaca,et al.  A Deep Learning Model for Epigenomic Studies , 2016, 2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[66]  Antonino Fiannaca,et al.  The General Regression Neural Network to Classify Barcode and mini-barcode DNA , 2014, CIBB.

[67]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[68]  G. Howe,et al.  Determinants of nucleosome positioning and their influence on plant gene expression , 2015, Genome research.

[69]  G. Schnitzler Control of Nucleosome Positions by DNA Sequence and Remodeling Machines , 2008, Cell Biochemistry and Biophysics.