Nonparametric Estimation of the Number of Unique Sequences in Biological Samples

Large-scale determination of uniquely expressed genes (or mRNAs) in specific cells and tissues is a challenging problem in computational and functional genomics. We consider nonparametric approaches for estimating the number of unique, nonredundant sequences in biological samples. By introducing the moments of species' abundance in a population, we analyze the relative abundance of species in the population and present a lower bound estimator and so-called medial estimator for the number of distinct species in the population. The lower bound estimate is applicable to populations with small coefficients of variation (CV). The medial estimator works well for the populations with relatively large CV, especially gene expression data. Simulation analysis shows that the medial estimator performs better than existing methods. Finally, we apply our nonparametric approaches to estimate the number of expressed mRNAs in a normal colon epithelial tissue as well as unique clones in an amplified cDNA sample prepared from the CNS of the sea-slug Aplysia

[1]  Michael L. Bittner,et al.  Genomic Signal Processing: The Salient Issues , 2004, EURASIP J. Adv. Signal Process..

[2]  I. Shmulevich,et al.  Computational and Statistical Approaches to Genomics , 2007, Springer US.

[3]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[4]  G. Landes,et al.  Analysis of human transcriptomes , 1999, Nature Genetics.

[5]  K. Burnham,et al.  Estimation of the size of a closed population when capture probabilities vary among animals , 1978 .

[6]  Warren W. Esty,et al.  The Efficiency of Good's Nonparametric Coverage Estimator , 1986 .

[7]  Vladimir A. Kuznetsov,et al.  Distribution Associated with Stochastic Processes of Gene Expression in a Single Eukaryotic Cell , 2001, EURASIP J. Adv. Signal Process..

[8]  Kevin R Coombes,et al.  Bayesian shrinkage estimation of the relative abundance of mRNA transcripts using SAGE. , 2003, Biometrics.

[9]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[10]  G. Belle,et al.  Nonparametric estimation of species richness , 1984 .

[11]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[12]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[13]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[14]  A Chao,et al.  Estimating population size via sample coverage for closed capture-recapture models. , 1994, Biometrics.

[15]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[16]  A. Chao,et al.  Stopping rules and estimation for recapture debugging with unequal failure rates , 1993 .

[17]  A. Chao Estimating the population size for capture-recapture data with unequal catchability. , 1987, Biometrics.

[18]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[19]  R H Hruban,et al.  Gene expression profiles in normal and cancer cells. , 1997, Science.

[20]  J. Stollberg,et al.  A quantitative evaluation of SAGE. , 2000, Genome research.

[21]  A. Chao,et al.  Estimating the Number of Classes via Sample Coverage , 1992 .

[22]  E. Dougherty,et al.  Genomic Signal Processing and Statistics , 2005 .

[23]  S. Lukyanov,et al.  Simple cDNA normalization using kamchatka crab duplex-specific nuclease. , 2004, Nucleic acids research.

[24]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .