An in-silico method for prediction of polyadenylation signals in human sequences.

This paper presents a machine learning method to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analysing features around them. This method consists of three sequential steps of feature manipulation: generation, selection and integration of features. In the first step, new features are generated using k-gram nucleotide acid or amino acid patterns. In the second step, a number of important features are selected by an entropy-based algorithm. In the third step, support vector machines are employed to recognize true PASes from a large number of candidates. Our study shows that true PASes in DNA and mRNA sequences can be characterized by different features, and also shows that both upstream and downstream sequence elements are important for recognizing PASes from DNA sequences. We tested our method on several public data sets as well as our own extracted data sets. In most cases, we achieved better validation results than those reported previously on the same data sets. The important motifs observed are highly consistent with those reported in literature.

[1]  M. Q. Zhang,et al.  Identification of human gene core promoters in silico. , 1998, Genome research.

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3]  L. Minvielle-Sebastia,et al.  mRNA polyadenylation and its coupling to other RNA processing reactions and to transcription. , 1999, Current opinion in cell biology.

[4]  K. Katz,et al.  Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. , 2000, Trends in genetics : TIG.

[6]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[7]  D. Hovorun,et al.  Downstream elements of mammalian pre-mRNA polyadenylation signals: primary, secondary and higher-order structures. , 2003, Nucleic acids research.

[8]  Huiqing Liu,et al.  Data Mining Tools for Biological Sequences , 2003, J. Bioinform. Comput. Biol..

[9]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[10]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[11]  R P Hart,et al.  Sequences capable of restoring poly(A) site function define two distinct downstream elements. , 1986, The EMBO journal.

[12]  Jack E. Tabaska,et al.  Detection of polyadenylation signals in human DNA sequences. , 1999, Gene.

[13]  J. Manley,et al.  Mechanism and regulation of mRNA polyadenylation. , 1997, Genes & development.

[14]  D. Haussler,et al.  Knowledge-based analysis of microarray gene expression , 2000 .

[15]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[16]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Y. Aissouni,et al.  The Cleavage/Polyadenylation Activity Triggered by a U-rich Motif Sequence Is Differently Required Depending on the Poly(A) Site Location at Either the First or Last 3′-Terminal Exon of the 2′-5′ Oligo(A) Synthetase Gene* , 2002, The Journal of Biological Chemistry.

[19]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[20]  Jing Zhao,et al.  Formation of mRNA 3′ Ends in Eukaryotes: Mechanism, Regulation, and Interrelationships with Other Steps in mRNA Synthesis , 1999, Microbiology and Molecular Biology Reviews.

[21]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[22]  M. Wollerton,et al.  The upstream sequence element of the C2 complement poly(A) signal activates mRNA 3' end formation by two distinct mechanisms. , 1998, Genes & development.

[23]  N. Proudfoot,et al.  Recruitment of a Basal Polyadenylation Factor by the Upstream Sequence Element of the Human Lamin B2 Polyadenylation Signal , 2000, Molecular and Cellular Biology.

[24]  D. Gautheret,et al.  Sequence determinants in human polyadenylation site selection , 2003, BMC Genomics.