Splice site identification by idlBNs

MOTIVATION Computational identification of functional sites in nucleotide sequences is at the core of many algorithms for the analysis of genomic data. This identification is based on the statistical parameters estimated from a training set. Often, because of the huge number of parameters, it is difficult to obtain consistent estimators. To simplify the estimation problem, one imposes independent assumptions between the nucleotides along the site. However, this can potentially limit the minimum value of the estimation error. RESULTS In this paper, we introduce a novel method in the context of identifying functional sites, that finds a reasonable set of independence assumptions supported by the data, among the nucleotides, and uses it to perform the identification of the sites by their likelihood ratio. More importantly, in many practical situations it is capable of improving its performance as the training sample size increases. We apply the method to the identification of splice sites, and further evaluate its effect within the context of exon and gene prediction.

[1]  Robert Castelo,et al.  Improved learning of Bayesian networks , 2001, UAI.

[2]  M. Degroot Optimal Statistical Decisions , 1970 .

[3]  Iraj Daizadeh,et al.  EID: the Exon?Intron Database?an exhaustive database of protein-coding intron-containing genes , 2000, Nucleic Acids Res..

[4]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[6]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[7]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[8]  Robert Castelo,et al.  On Inclusion-Driven Learning of Bayesian Networks , 2003, J. Mach. Learn. Res..

[9]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[10]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[11]  Pankaj Agarwal,et al.  Detecting non-adjoining correlations with signals in DNA , 1998, RECOMB '98.

[12]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[13]  David Maxwell Chickering,et al.  Optimal Structure Identification With Greedy Search , 2002, J. Mach. Learn. Res..

[14]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[15]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[16]  Daniel Kahneman,et al.  Probabilistic reasoning , 1993 .

[17]  E. S. Pearson,et al.  ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  C. Burge Chapter 8 – Modeling dependencies in pre-mRNA splicing signals , 1998 .

[20]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.