Lineage‐associated underrepresented permutations (LAUPs) of mammalian genomic sequences based on a Jellyfish‐based LAUPs analysis application (JBLA)

Motivation: This study addresses several important questions related to naturally underrepresented sequences: (i) are there permutations of real genomic DNA sequences in a defined length (k‐mer) and a given lineage that do not actually exist or underrepresented? (ii) If there are such sequences, what are their characteristics in terms of k‐mer length and base composition? (iii) Are they related to CpG or TpA underrepresentation known for human sequences? We propose that the answers to these questions are of great significance for the study of sequence‐associated regulatory mechanisms, such cytosine methylation and chromosomal structures in physiological or pathological conditions such as cancer. Results: We empirically defined sequences that were not included in any well‐known public databases as lineage‐associated underrepresented permutations (LAUPs). Then, we developed a Jellyfish‐based LAUPs analysis application (JBLA) to investigate LAUPs for 24 representative species. The present discoveries include: (i) lengths for the shortest LAUPs, ranging from 10 to 14, which collectively constitute a low proportion of the genome. (ii) Common LAUPs showing higher CG content over the analysed mammalian genome and possessing distinct CG*CG motifs. (iii) Neither CpG‐containing LAUPs nor CpG island sequences are randomly structured and distributed over the genomes; some LAUPs and most CpG‐containing sequences exhibit an opposite trend within the same k and n variants. In addition, we demonstrate that the JBLA algorithm is more efficient than the original Jellyfish for computing LAUPs. Availability and implementation: We developed a Jellyfish‐based LAUP analysis (JBLA) application by integrating Jellyfish (Marçais and Kingsford, 2011), MEME (Bailey, et al., 2009) and the NCBI genome database (Pruitt, et al., 2007) applications, which are listed as Supplementary Material. Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Robert Giegerich,et al.  BMC Bioinformatics BioMed Central Methodology article Efficient computation of absent words in genomic sequences , 2008 .

[2]  Eric T Kool,et al.  Hydrophobic, Non-Hydrogen-Bonding Bases and Base Pairs in DNA. , 1995, Journal of the American Chemical Society.

[3]  Jun Yu,et al.  On the nature of human housekeeping genes. , 2008, Trends in genetics : TIG.

[4]  Kimberly Glass,et al.  All and only CpG containing sequences are enriched in promoters abundantly bound by RNA polymerase II in multiple tissues , 2008, BMC Genomics.

[5]  Steve Horvath,et al.  Repetitive sequence environment distinguishes housekeeping genes. , 2007, Gene.

[6]  Bin Hu,et al.  Investigation of mechanism of bone regeneration in a porous biodegradable calcium phosphate (CaP) scaffold by a combination of a multi-scale agent-based model and experimental optimization/validation. , 2016, Nanoscale.

[7]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[8]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[9]  Songnian Hu,et al.  A novel DNA sequence periodicity decodes nucleosome positioning , 2008, Nucleic acids research.

[10]  M. Rief,et al.  Mechanical stability of single DNA molecules. , 2000, Biophysical journal.

[11]  Leng Han,et al.  CpG island density and its correlations with genomic features in mammalian genomes , 2008, Genome Biology.

[12]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[13]  U. Bockelmann,et al.  Mechanical separation of the complementary strands of DNA. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Na Li,et al.  EZH2-, CHD4-, and IDH-linked epigenetic perturbation and its association with survival in glioma patients , 2017, Journal of molecular cell biology.

[15]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[16]  L. Machattie,et al.  Limited permutations of the nucleotide sequence in bacteriophage T1 DNA. , 1976, Journal of molecular biology.

[17]  B F Ouellette,et al.  The GenBank sequence database. , 1998, Methods of biochemical analysis.

[18]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[19]  Manasi Gadkari,et al.  Developmentally Programmed 3′ CpG Island Methylation Confers Tissue- and Cell-Type-Specific Transcriptional Activation , 2013, Molecular and Cellular Biology.

[20]  Hideaki Sugawara,et al.  DNA Data Bank of Japan (DDBJ) for genome scale research in life science , 2002, Nucleic Acids Res..

[21]  Liqing Zhang,et al.  Housekeeping and tissue-specific genes differ in simple sequence repeats in the 5'-UTR region. , 2008, Gene.

[22]  Liquan Xiao,et al.  On the Shoulders of Giants: Incremental Influence Maximization in Evolving Social Networks , 2015, Complex..

[23]  Badong Chen,et al.  Building Up a Robust Risk Mathematical Platform to Predict Colorectal Cancer , 2017, Complex..

[24]  Robert Riehn,et al.  CpG and methylation-dependent DNA binding and dynamics of the methylcytosine binding domain 2 protein at the single-molecule level , 2017, Nucleic acids research.

[25]  P. D’haeseleer What are DNA sequence motifs? , 2006, Nature Biotechnology.

[26]  T. Yomo,et al.  Concordant evolution of coding and noncoding regions of DNA made possible by the universal rule of TA/CG deficiency-TG/CT excess. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Tian-biao Zhang,et al.  Determination of Base Binding Strength and Base Stacking Interaction of DNA Duplex Using Atomic Force Microscope , 2015, Scientific Reports.

[28]  L. Mularoni,et al.  Housekeeping genes tend to show reduced upstream sequence conservation , 2007, Genome Biology.

[29]  Ari M. P. Koskinen,et al.  Asymmetric synthesis of natural products , 1993 .

[30]  Xiaobo Zhou,et al.  Novel 3D GPU based numerical parallel diffusion algorithms in cylindrical coordinates for health care simulation , 2015, Math. Comput. Simul..

[31]  Xiaobo Zhou,et al.  Employing graphics processing unit technology, alternating direction implicit method and domain decomposition to speed up the numerical diffusion solver for the biomedical engineering research , 2011 .

[32]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[33]  J. Biro,et al.  Frequent occurrence of short complementary sequences in nucleic acids. , 1986, Biochemical and biophysical research communications.

[34]  Timothy L. Andersen,et al.  Absent Sequences: Nullomers and Primes , 2006, Pacific Symposium on Biocomputing.

[35]  E. E. Max,et al.  CG dinucleotide clusters in MHC genes and in 5' demethylated genes. , 1984, Nucleic acids research.

[36]  Ruiting Lan,et al.  Evolutionary Relationships of Pathogenic Clones of Vibrio cholerae by Sequence Analysis of Four Housekeeping Genes , 1999, Infection and Immunity.

[37]  S Brunak,et al.  Structural analysis of DNA sequence: evidence for lateral gene transfer in Thermotoga maritima. , 2000, Nucleic acids research.

[38]  T. Grisar,et al.  Housekeeping genes as internal standards: use and limits. , 1999, Journal of biotechnology.

[39]  Daiya Takai,et al.  Comprehensive analysis of CpG islands in human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Yi-ping Wang,et al.  Cloning, expression, and purification of lipoprotein-associated phospholipase A2 in Pichia pastoris , 2006, Molecular biotechnology.

[41]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[42]  Daniele Santoni,et al.  Nullomers and High Order Nullomers in Genomic Sequences , 2016, PloS one.

[43]  Albert Jeltsch,et al.  Circular Permutations in the Molecular Evolution of DNA Methyltransferases , 1999, Journal of Molecular Evolution.

[44]  Xiaobo Zhou,et al.  Characterization of p38 MAPK isoforms for drug resistance study using systems biology approach , 2014, Bioinform..

[45]  Janusz M Bujnicki,et al.  Sequence permutations in the molecular evolution of DNA methyltransferases , 2002, BMC Evolutionary Biology.

[46]  G. Ferenczy,et al.  Optical Trapping Nanometry of Hypermethylated CPG-Island DNA. , 2017, Biophysical journal.

[47]  Le Zhang,et al.  Developing an Agent-Based Drug Model to Investigate the Synergistic Effects of Drug Combinations , 2017, Molecules.

[48]  Rodrigo Lopez,et al.  The EMBL Nucleotide Sequence Database , 1999, Nucleic Acids Res..