MOTIVATION
SAGE (Serial Analysis of Gene Expression) can be used to estimate the number of unique transcripts in a transcriptome. A simple estimator that corrects for sequencing and sampling errors was applied to a SAGE library (137 832 tags) obtained from mouse embryonic stem cells, and also to Monte Carlo simulated libraries generated using assumed distributions of 'true' expression levels consistent with the data.
RESULTS
When the corrected data themselves were taken as the underlying model of 'ground truth', the estimator converged to the 'true' value (53 535) only after counting 300 000 simulated tags, more than twice the number in the experiment. The SAGE data could also be well fit by a Monte Carlo model based on a truncated inverse-square distribution of expression levels, with 130 000 'true' transcripts and 10(6) samples needed for convergence. We conclude that the size of a transcriptome is ill-determined from SAGE libraries of even moderately large size. In order to obtain a valid estimate, one must sample a number of tags inversely proportional to the lowest abundance level, which is not known a priori. This constrains the design of SAGE experiments intended to determine biological complexity.
AVAILABILITY
The 'homemade' software used for this analysis was not designed for general or 'production' use, but the authors will be happy to share Fortran sourcecode with interested parties.
CONTACT
sternm@grc.nia.nih.gov
[1]
R. Jaenisch,et al.
Epigenetic Instability in ES Cells and Cloned Mice
,
2001,
Science.
[2]
Yixin Wang,et al.
POWER_SAGE: comparing statistical tests for SAGE experiments
,
2000,
Bioinform..
[3]
Ji Huang,et al.
[Serial analysis of gene expression].
,
2002,
Yi chuan = Hereditas.
[4]
Jacques Colinge,et al.
Bioinformatics Applications Note Detecting the Impact of Sequencing Errors on Sage Data
,
2022
.
[5]
S. V. Anisimov,et al.
SAGE identification of gene transcripts with profiles unique to pluripotent mouse R1 embryonic stem cells.
,
2002,
Genomics.
[6]
J. Bunge,et al.
Estimating the Number of Species: A Review
,
1993
.
[7]
M. Gerstein,et al.
A question of size: the eukaryotic proteome and the problems in defining it.
,
2002,
Nucleic acids research.
[8]
J. Stollberg,et al.
A quantitative evaluation of SAGE.
,
2000,
Genome research.