Perspectives in Computational Genome Analysis

DNA segments which together cover a genome may be collected together to form a genomic dictionary of specific words, which may be annotated either by biological information (according to the functional role they may play in regulatory mechanisms), or by numerical information (such as the position in the genome, the total number of occurrences, the occurrences lying inside or outside genic sequences, the CpG content, and more sophisticated informational indexes of text analysis). In this chapter, two analogous and complementary dictionary-based approaches to genome analysis are reviewed. We give a sketch of some of the relevant knowledge about the (human) genome, in terms of structure and functional role of its parts, and an informational view based on a mathematical analysis of k-mer dictionaries, with the aim of opening the way to the formulation of a model. Basic notions about genomic regulatory activity, where the underlying mechanisms of information exchange are far from understood, are given. A description of an initial attempt at computational modeling of genomes, seen as a new language to be deciphered, concludes the chapter.

[1]  Mario Gimona,et al.  Protein linguistics — a grammar for modular protein assembly? , 2006, Nature Reviews Molecular Cell Biology.

[2]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[3]  Shane J. Neph,et al.  An expansive human regulatory lexicon encoded in transcription factor footprints , 2012, Nature.

[4]  Antonio Restivo,et al.  Forbidden Factors and Fragment Assembly , 2001, Developments in Language Theory.

[5]  E. Pennisi Genomics. ENCODE project writes eulogy for junk DNA. , 2012, Science.

[6]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[7]  M. Essand,et al.  A Novel Chromogranin-A Promoter-Driven Oncolytic Adenovirus for Midgut Carcinoid Therapy , 2007, Clinical Cancer Research.

[8]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[9]  T. Head Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors. , 1987, Bulletin of mathematical biology.

[10]  Robert Giegerich,et al.  BMC Bioinformatics BioMed Central Methodology article Efficient computation of absent words in genomic sequences , 2008 .

[11]  Antonio Restivo,et al.  Words and forbidden factors , 2002, Theor. Comput. Sci..

[12]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[13]  Jean Peccoud,et al.  A syntactic model to design and verify synthetic genetic constructs derived from standard biological parts , 2007, Bioinform..

[14]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[15]  Yi Luo,et al.  How independent are the appearances of n-mers in different genomes? , 2004, Bioinform..

[16]  W. Bickmore Serving up a genome feast , 2012 .

[17]  Antonio Restivo,et al.  Word assembly through minimal forbidden words , 2006, Theor. Comput. Sci..

[18]  P. Pandolfi,et al.  A coding-independent function of gene and pseudogene mRNAs regulates tumour biology , 2010, Nature.

[19]  Giuditta Franco,et al.  An Investigation on Genomic Repeats , 2013, CiE.

[20]  Timothy L. Andersen,et al.  Absent Sequences: Nullomers and Primes , 2006, Pacific Symposium on Biocomputing.

[21]  Michael Lynch,et al.  The Origins of Genome Architecture , 2007 .

[22]  Benno Schwikowski,et al.  Graph-based methods for analysing networks in cell biology , 2006, Briefings Bioinform..

[23]  V. Manca,et al.  A dictionary based informational genome analysis , 2012, BMC Genomics.

[24]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[25]  Ferdinando Di Cunto,et al.  Coding-Independent Regulation of the Tumor Suppressor PTEN by Competing Endogenous mRNAs , 2011, Cell.

[26]  G. J. SYMONS Ocean Meteorological Observations , 1872, Nature.

[27]  Jerome K. Percus Mathematics of Genome Analysis , 2001 .

[28]  Jonas S. Almeida,et al.  Local Renyi entropic profiles of DNA sequences , 2007, BMC Bioinformatics.

[29]  V. Brendel,et al.  Genome structure described by formal languages. , 1984, Nucleic acids research.

[30]  Vincenzo Manca,et al.  An algorithmic analysis of DNA structure , 2005, Soft Comput..

[31]  David B. Searls,et al.  String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA , 1995, J. Log. Program..

[32]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[33]  Jean Peccoud,et al.  Modeling Structure-Function Relationships in Synthetic DNA Sequences using Attribute Grammars , 2009, PLoS Comput. Biol..

[34]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[35]  James B. Brown,et al.  Modeling gene expression using chromatin features in various cellular contexts , 2012, Genome Biology.

[36]  Gautier Koscielny,et al.  Analysis of variation at transcription factor binding sites in Drosophila and humans , 2012, Genome Biology.

[37]  M. Gerstein,et al.  The GENCODE pseudogene resource , 2012, Genome Biology.

[38]  Alberto Castellini,et al.  A genome analysis based on repeat sharing gene networks , 2014, Natural Computing.

[39]  Kjell Öberg,et al.  Double-Detargeted Oncolytic Adenovirus Shows Replication Arrest in Liver Cells and Retains Neuroendocrine Cell Killing Ability , 2010, PloS one.

[40]  J. Dekker,et al.  The long-range interaction landscape of gene promoters , 2012, Nature.

[41]  G. Brownlee,et al.  A pseudogene structure in 5S DNA of Xenopus laevis , 1977, Cell.

[42]  D. Cook,et al.  ggbio: an R package for extending the grammar of graphics for genomic data , 2012, Genome Biology.

[43]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[44]  L. Poliseno Pseudogenes: Newly Discovered Players in Human Cancer , 2012, Science Signaling.