A Bayesian Framework for Eucaryotic Promoter Recognition

Eucaryotic promoter recognition plays an important role in finding the transcription start site and understanding the differential gene expression regulated at the transcription level. In the past decades, as far as the feature choice is concerned, the signal-based and the content-based methods are two major strategies to recognize eucaryotic promoters. The signal-based method mainly uses salient biological signals as features, including transcription factor binding sites (e.g., TATA-box or CCAAT-box), conserved initiator region, and mammalian CpG islands. Despite the excellent descriptive ability for the fine details of the core promoters, the general problem of using signal-based method is the unacceptably high false positive rates. On the other hand, the content-based method focuses on the genetic context of promoters rather than their exact position (e.g., the IUPAC words and their statistics), and is able to reduce false positive rates significantly while maintaining a relatively high sensitivity. However, a theoretically well-founded framework is needed to build classifiers based on contextual information of promoters. We have developed a Bayesian framework for eucaryotic promoter recognition. Bayesian decision theory has been one of the fundamental pattern recognition theories and been applied successfully to solve many pattern recognition problems. In the feature choice, we use the IUPAC words to describe the contextual information of core promoters. In the model choice, we use the normalized histograms to approximate the underlying probability distributions of IUPAC words extracted from training sets. Bayesian decision theory then minimizes the probability of the recognition error based on these probability distributions. We have tested our method on large genomic sequences over $1.3$ million base-pairs and have obtained encouraging results. The Bayesian framework presented here is efficient and theoretically well-founded. It improves the recognition performance of traditional content-based methods. Compared with other advanced promoter recognition systems, such as PromoterInspector, analysis of several controlled databases and large genomic sequences demonstrates that the our method achieves the best balance between sensitivity and specificity.