One-Shot Learning of Poisson Distributions in Serial Analysis of Gene Expression

Traditionally, studies in learning theory tend to concentrate on situations where potentially ever increasing number of training examples is available. However, there are situations where only extremely small samples can be used in order to perform an inference. In such situations it is of utmost importance to theoretically analyze what and under what circumstances can be learned. One such scenario is detection of differentially expressed genes. In our previous study (BMC Bioinformatics, 2009) we theoretically analyzed one of the most popular techniques for identifying genes with statistically different expression in SAGE libraries - the Audic-Claverie statistic (Genome Research, 1997). When comparing two libraries in the Audic-Claverie framework, it is assumed that under the null hypothesis their tag counts come from the same underlying (unknown) Poisson distribution. Since each SAGE library represents a single measurement, the inference has to be performed on the smallest sample possible - sample of size 1. In this contribution we compare the Audic-Claverie approach with a (regularized) maximum likelihood (ML) framework. We analytically approximate the expected K-L divergence from the true unknown Poisson distribution to the model and show that while the expected K-L divergence to the ML-estimated models seems to be always larger than that of the Audic-Claverie statistic, the most divergence appears for true Poisson distributions with small mean parameter. We also theoretically analyze the effect of regularization of ML estimates in the case of zero observed counts. Our results constitute a rigorous analysis of a situation of great practical importance where the benefits of Bayesian approach can be clearly demonstrated in a quantitative and principled manner.

[1]  C. V. Van Tassell,et al.  Comparative transcriptome analysis of in vivo‐ and in vitro‐produced porcine blastocysts by small amplified RNA‐Serial analysis of gene expression (SAR‐SAGE) , 2008, Molecular reproduction and development.

[2]  Hyun-Jin Kim,et al.  Pepper EST database: comprehensive in silico tool for analyzing the chili pepper (Capsicum annuum) transcriptome , 2008, BMC Plant Biology.

[3]  Nanxiang Ge,et al.  An Empirical Bayesian Significance Test of cDNA Library Data , 2004, J. Comput. Biol..

[4]  D. Stekel,et al.  The comparison of gene expression from multiple cDNA libraries. , 2000, Genome research.

[5]  L. Varuzza,et al.  Significance tests for comparing digital gene expression profiles , 2008 .

[6]  Peter Tiño,et al.  Basic properties and information theory of Audic-Claverie statistic for analyzing cDNA arrays , 2009, BMC Bioinformatics.

[7]  J. Claverie,et al.  The significance of digital gene expression profiles. , 1997, Genome research.

[8]  C. Molina,et al.  SuperSAGE: the drought stress-responsive transcriptome of chickpea roots , 2008, BMC Genomics.

[9]  K. Kinzler,et al.  Serial Analysis of Gene Expression , 1995, Science.

[10]  Ryan D. Morin,et al.  Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. , 2008, Genome research.

[11]  J. Ruijter,et al.  Statistical evaluation of SAGE libraries: consequences for experimental design. , 2002, Physiological genomics.

[12]  Ramón Díaz-Uriarte,et al.  Detection of recurrent copy number alterations in the genome: taking among-subject heterogeneity seriously , 2009, BMC Bioinformatics.

[13]  G. Cervigni,et al.  Gene expression in diplosporous and sexual Eragrostis curvula genotypes with differing ploidy levels , 2008, Plant Molecular Biology.