Position-dependent motif characterization using non-negative matrix factorization

Motivation: Cis-acting regulatory elements are frequently constrained by both sequence content and positioning relative to a functional site, such as a splice or polyadenylation site. We describe an approach to regulatory motif analysis based on non-negative matrix factorization (NMF). Whereas existing pattern recognition algorithms commonly focus primarily on sequence content, our method simultaneously characterizes both positioning and sequence content of putative motifs. Results: Tests on artificially generated sequences show that NMF can faithfully reproduce both positioning and content of test motifs. We show how the variation of the residual sum of squares can be used to give a robust estimate of the number of motifs or patterns in a sequence set. Our analysis distinguishes multiple motifs with significant overlap in sequence content and/or positioning. Finally, we demonstrate the use of the NMF approach through characterization of biologically interesting datasets. Specifically, an analysis of mRNA 3′-processing (cleavage and polyadenylation) sites from a broad range of higher eukaryotes reveals a conserved core pattern of three elements. Contact: joel.graber@jax.org Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Francisco Tirado,et al.  bioNMF: a versatile tool for non-negative matrix factorization in biology , 2006, BMC Bioinformatics.

[2]  W. J. Kent,et al.  Environmentally Induced Foregut Remodeling by PHA-4/FoxA and DAF-12/NHR , 2004, Science.

[3]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[4]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[5]  Lucie N. Hutchins,et al.  C. elegans sequences that control trans-splicing and operon pre-mRNA processing. , 2007, RNA.

[6]  Christopher B. Burge,et al.  RESCUE-ESE identifies candidate exonic splicing enhancers in vertebrate exons , 2004, Nucleic Acids Res..

[7]  D Haussler,et al.  Integrating database homology in a probabilistic gene structure model. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[8]  Naum I Gershenzon,et al.  The features of Drosophila core promoters revealed by statistical analysis , 2006, BMC Genomics.

[9]  B. Tian,et al.  Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. , 2005, RNA.

[10]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[11]  Hiroo Iwata,et al.  Spatially and temporally controlled gene transfer by electroporation into adherent cells on plasmid DNA-loaded electrodes. , 2004, Nucleic acids research.

[12]  S. Hannenhalli,et al.  Position and distance specificity are important determinants of cis-regulatory motifs in addition to evolutionary conservation , 2007, Nucleic acids research.

[13]  Francisco Tirado,et al.  bioNMF: a web-based tool for nonnegative matrix factorization in biology , 2008, Nucleic Acids Res..

[14]  Donglin Liu,et al.  BIOINFORMATICS APPLICATIONS NOTE Databases and ontologies PACdb: PolyA Cleavage Site and 3 ′-UTR Database , 2022 .

[15]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[16]  K. Venkataraman,et al.  Analysis of a noncanonical poly(A) site reveals a tripartite mechanism for vertebrate poly(A) site recognition. , 2005, Genes & development.

[17]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[18]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[19]  Heinrich Niemann,et al.  Computational analysis of core promoters in the Drosophilagenome Citation Genome , 2003 .

[20]  N. Proudfoot,et al.  Position-dependent sequence elements downstream of AAUAAA are required for efficient rabbit β-globin mRNA 3′ end formation , 1987, Cell.

[21]  Jacques van Helden,et al.  Regulatory Sequence Analysis Tools , 2003, Nucleic Acids Res..

[22]  Nan Li,et al.  Analysis of computational approaches for motif discovery , 2006, Algorithms for Molecular Biology.

[23]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..

[24]  J. Lis,et al.  GAGA factor binding to DNA via a single trinucleotide sequence element. , 1998, Nucleic acids research.

[25]  R. Amann,et al.  Predictive Identification of Exonic Splicing Enhancers in Human Genes , 2022 .

[26]  Philip M. Kim,et al.  Subsystem identification through dimensionality reduction of large-scale gene expression data. , 2003, Genome research.

[27]  Chengyu Liu,et al.  Biclustering of gene expression data by non-smooth non-negative matrix factorization , 2010 .

[28]  N. Proudfoot,et al.  3′ Non-coding region sequences in eukaryotic messenger RNA , 1976, Nature.

[29]  J. Lis,et al.  Distribution of GAGA protein on Drosophila genes in vivo. , 1995, Genes & development.

[30]  J. Graber,et al.  A multispecies comparison of the metazoan 3'-processing downstream elements and the CstF-64 RNA recognition motif , 2006, BMC Genomics.

[31]  Bin Tian,et al.  PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes , 2007, Nucleic Acids Res..

[32]  Jing Zhao,et al.  Formation of mRNA 3′ Ends in Eukaryotes: Mechanism, Regulation, and Interrelationships with Other Steps in mRNA Synthesis , 1999, Microbiology and Molecular Biology Reviews.

[33]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[34]  Dietrich Lehmann,et al.  Nonsmooth nonnegative matrix factorization (nsNMF) , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[36]  U. Ohler Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction , 2006, Nucleic acids research.

[37]  Qingshun Quinn Li,et al.  Compilation of mRNA Polyadenylation Signals in Arabidopsis Revealed a New Signal Element and Potential Secondary Structures1[w] , 2005, Plant Physiology.

[38]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[39]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[40]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[41]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .