Predictive modeling of plant messenger RNA polyadenylation sites

BackgroundOne of the essential processing events during pre-mRNA maturation is the post-transcriptional addition of a polyadenine [poly(A)] tail. The 3'-end poly(A) track protects mRNA from unregulated degradation, and indicates the integrity of mRNA through recognition by mRNA export and translation machinery. The position of a poly(A) site is predetermined by signals in the pre-mRNA sequence that are recognized by a complex of polyadenylation factors. These signals are generally tri-part sequence patterns around the cleavage site that serves as the future poly(A) site. In plants, there is little sequence conservation among these signal elements, which makes it difficult to develop an accurate algorithm to predict the poly(A) site of a given gene. We attempted to solve this problem.ResultsBased on our current working model and the profile of nucleotide sequence distribution of the poly(A) signals and around poly(A) sites in Arabidopsis, we have devised a Generalized Hidden Markov Model based algorithm to predict potential poly(A) sites. The high specificity and sensitivity of the algorithm were demonstrated by testing several datasets, and at the best combinations, both reach 97%. The accuracy of the program, called p oly(A) s ite s leuth or PASS, has been demonstrated by the prediction of many validated poly(A) sites. PASS also predicted the changes of poly(A) site efficiency in poly(A) signal mutants that were constructed and characterized by traditional genetic experiments. The efficacy of PASS was demonstrated by predicting poly(A) sites within long genomic sequences.ConclusionBased on the features of plant poly(A) signals, a computational model was built to effectively predict the poly(A) sites in Arabidopsis genes. The algorithm will be useful in gene annotation because a poly(A) site signifies the end of the transcript. This algorithm can also be used to predict alternative poly(A) sites in known genes, and will be useful in the design of transgenes for crop genetic engineering by predicting and eliminating undesirable poly(A) sites.

[1]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[2]  Shivakundan Singh Tej,et al.  Analysis of the transcriptional complexity of Arabidopsis thaliana by massively parallel signature sequencing , 2004, Nature Biotechnology.

[3]  A. Krogh Two methods for improving performance of an HMM application for gene finding , 1997 .

[4]  B. Tian,et al.  Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. , 2005, RNA.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Heleń M. Rothnie,et al.  Plant mRNA 3′-end formation , 1996, Plant Molecular Biology.

[7]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[8]  Qingshun Quinn Li,et al.  Compilation of mRNA Polyadenylation Signals in Arabidopsis Revealed a New Signal Element and Potential Secondary Structures1[w] , 2005, Plant Physiology.

[9]  Ying Lu,et al.  Sequence analysis of mRNA polyadenylation signals of rice genes , 2006 .

[10]  T. Hohn,et al.  The contribution of AAUAAA and the upstream element UUUGUA to the efficiency of mRNA 3′‐end formation in plants. , 1994, The EMBO journal.

[11]  Huiqing Liu,et al.  An in-silico method for prediction of polyadenylation signals in human sequences. , 2003, Genome informatics. International Conference on Genome Informatics.

[12]  M. Macdonald,et al.  Several distinct types of sequence elements are required for efficient mRNA 3' end formation in a pea rbcS gene , 1992, Molecular and cellular biology.

[13]  S. V. Vaseghi State duration modelling in hidden Markov models , 1995, Signal Process..

[14]  Q. Li,et al.  The Polyadenylation of RNA in Plants , 1997, Plant physiology.

[15]  Q. Li,et al.  A near-upstream element in a plant polyadenylation signal consists of more than six nucleotides , 1995, Plant Molecular Biology.

[16]  W. Filipowicz,et al.  Extreme heterogeneity of polyadenylation sites in mRNAs encoding chloroplast RNA-binding proteins in Nicotiana plumbaginifolia , 1995, Plant Molecular Biology.

[17]  Bin Tian,et al.  A large-scale analysis of mRNA polyadenylation of human and mouse genes , 2005, Nucleic acids research.

[18]  Temple F. Smith,et al.  Probabilistic prediction of Saccharomyces cerevisiae mRNA 3'-processing sites. , 2002, Nucleic acids research.

[19]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[20]  Robert M. Miura,et al.  Prediction of mRNA polyadenylation sites by support vector machine , 2006, Bioinform..

[21]  M. Macdonald,et al.  Upstream sequences other than AAUAAA are required for efficient messenger RNA 3'-end formation in plants. , 1990, The Plant cell.

[22]  D. Hildebrand,et al.  Design and construction of a versatile system for the expression of foreign genes in plants. , 1987, Gene.

[23]  Jing Zhao,et al.  Formation of mRNA 3′ Ends in Eukaryotes: Mechanism, Regulation, and Interrelationships with Other Steps in mRNA Synthesis , 1999, Microbiology and Molecular Biology Reviews.

[24]  Nick Proudfoot,et al.  New perspectives on connecting messenger RNA 3' end formation to transcription. , 2004, Current opinion in cell biology.

[25]  N. Alexandrov,et al.  Features of Arabidopsis Genes and Genome Discovered using Full-length cDNAs , 2005, Plant Molecular Biology.

[26]  U. Grossniklaus,et al.  A Gateway Cloning Vector Set for High-Throughput Functional Analysis of Genes in Planta[w] , 2003, Plant Physiology.