A classification-based prediction model of messenger RNA polyadenylation sites.

Messenger RNA polyadenylation is one of the essential processing steps during eukaryotic gene expression. The site of polyadenylation [(poly(A) site] marks the end of a transcript, which is also the end of a gene. A computation program that is able to recognize poly(A) sites would not only prove useful for genome annotation in finding genes ends, but also for predicting alternative poly(A) sites. Features that define the poly(A) sites can now be extracted from the poly(A) site datasets to build such predictive models. Using methods, including K-gram pattern, Z-curve, position-specific scoring matrix and first-order inhomogeneous Markov sub-model, numerous features were generated and placed in an original feature space. To select the most useful features, attribute selection algorithms, such as information gain and entropy, were employed. A training model was then built based on the Bayesian network to determine a subset of the optimal features. Test models corresponding to the training models were built to predict poly(A) sites in Arabidopsis and rice. Thus, a prediction model, termed Poly(A) site classifier, or PAC, was constructed. The uniqueness of the model lies in its structure in that each sub-model can be replaced or expanded, while feature generation, selection and classification are all independent processes. Its modular design makes it easily adaptable to different species or datasets. The algorithm's high specificity and sensitivity were demonstrated by testing several datasets and, at the best combinations, they both reached 95%. The software package may be used for genome annotation and optimizing transgene structure.

[1]  K. Chou,et al.  Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. , 2008, Journal of theoretical biology.

[2]  Chun Liang,et al.  Unique Features of Nuclear mRNA Poly(A) Signals and Alternative Polyadenylation in Chlamydomonas reinhardtii , 2008, Genetics.

[3]  Guoli Ji,et al.  Genome level analysis of rice mRNA 3′-end processing signals and alternative polyadenylation , 2008, Nucleic acids research.

[4]  Kuo-Chen Chou,et al.  Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes , 2008, J. Comput. Chem..

[5]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[6]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[7]  Z. Huang,et al.  Using complexity measure factor to predict protein subcellular location , 2005, Amino Acids.

[8]  B. Tian,et al.  Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. , 2005, RNA.

[9]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Chuan Hock Koh,et al.  Recognition of polyadenylation sites from Arabidopsis genomic sequences. , 2007, Genome informatics. International Conference on Genome Informatics.

[11]  C R Cantor,et al.  Genomic detection of new yeast pre-mRNA 3'-end-processing signals. , 1999, Nucleic acids research.

[12]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[13]  Kuo-Chen Chou,et al.  A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. , 2009, Analytical biochemistry.

[14]  Kuo-Chen Chou,et al.  Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. , 2008, Journal of theoretical biology.

[15]  K. Chou,et al.  ProtIdent: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information. , 2008, Biochemical and biophysical research communications.

[16]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[17]  Xiaoyong Zou,et al.  Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. , 2009, Protein and peptide letters.

[18]  Qingshun Quinn Li,et al.  Compilation of mRNA Polyadenylation Signals in Arabidopsis Revealed a New Signal Element and Potential Secondary Structures1[w] , 2005, Plant Physiology.

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[20]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[21]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[22]  P. Green,et al.  Premature polyadenylation at multiple sites within a Bacillus thuringiensis toxin gene-coding region. , 1998, Plant physiology.

[23]  Kuo-Chen Chou,et al.  GPCR‐CA: A cellular automaton image approach for predicting G‐protein–coupled receptor functional classes , 2009, J. Comput. Chem..

[24]  Chun Liang,et al.  Expressed Sequence Tags With cDNA Termini: Previously Overlooked Resources for Gene Annotation and Transcriptome Exploration in Chlamydomonas reinhardtii , 2008, Genetics.

[25]  Hongwei Zhao,et al.  Arabidopsis PCFS4, a homologue of yeast polyadenylation factor Pcf11p, regulates FCA alternative processing and promotes flowering time. , 2008, The Plant journal : for cell and molecular biology.

[26]  Z. Huang,et al.  Using cellular automata images and pseudo amino acid composition to predict protein subcellular location , 2005, Amino Acids.

[27]  Mary E. Edgerton,et al.  Ensemble Stump Classifiers and Gene Expression Signatures in Lung Cancer , 2007, MedInfo.

[28]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[29]  Xiaohui Wu,et al.  Predictive modeling of plant messenger RNA polyadenylation sites , 2007, BMC Bioinformatics.

[30]  Robert M. Miura,et al.  Prediction of mRNA polyadenylation sites by support vector machine , 2006, Bioinform..

[31]  Yanzhi Guo,et al.  Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. , 2009, Journal of theoretical biology.

[32]  Gang Wang,et al.  WebTraceMiner: a web service for processing and mining EST sequence trace files , 2007, Nucleic Acids Res..

[33]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[34]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[35]  Xiaohui Wu,et al.  Modeling Plant mRNA Poly(A) Sites: Software Design and Implementation , 2007 .

[36]  Q. Li,et al.  Calmodulin Interacts with and Regulates the RNA-Binding Activity of an Arabidopsis Polyadenylation Factor Subunit1[OA] , 2006, Plant Physiology.

[37]  Richard Durbin,et al.  A probabilistic model of 3' end formation in Caenorhabditis elegans. , 2004, Nucleic acids research.

[38]  M. Brent Steady progress and recent breakthroughs in the accuracy of automated genome annotation , 2008, Nature Reviews Genetics.

[39]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[40]  C. Lutz,et al.  Alternative polyadenylation: a twist on mRNA 3' end formation. , 2008, ACS chemical biology.

[41]  V. Quesada,et al.  Regulated RNA processing in the control of Arabidopsis flowering. , 2005, The International journal of developmental biology.

[42]  Kuo-Chen Chou,et al.  GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. , 2009, Protein engineering, design & selection : PEDS.

[43]  Huiqing Liu,et al.  An in-silico method for prediction of polyadenylation signals in human sequences. , 2003, Genome informatics. International Conference on Genome Informatics.

[44]  Guo-Ping Zhou,et al.  An Intriguing Controversy over Protein Structural Class Prediction , 1998, Journal of protein chemistry.

[45]  Stevo K. Jaćimovski,et al.  Statistical and Dynamical Equivalence of Different Elementary Cells , 2007 .

[46]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[47]  Haibo Zhang,et al.  Biased alternative polyadenylation in human tissues , 2005, Genome Biology.

[48]  K. Chou,et al.  Protein subcellular location prediction. , 1999, Protein engineering.

[49]  Q. Li,et al.  The Polyadenylation of RNA in Plants , 1997, Plant physiology.

[50]  J. Nieto,et al.  Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. , 2009, Journal of theoretical biology.