Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries

BackgroundIn expressed sequence tag (EST) sequencing, we are often interested in how many genes we can capture in an EST sample of a targeted size. This information provides insights to sequencing efficiency in experimental design, as well as clues to the diversity of expressed genes in the tissue from which the library was constructed.ResultsWe propose a compound Poisson process model that can accurately predict the gene capture in a future EST sample based on an initial EST sample. It also allows estimation of the number of expressed genes in one cDNA library or co-expressed in two cDNA libraries. The superior performance of the new prediction method over an existing approach is established by a simulation study. Our analysis of four Arabidopsis thaliana EST sets suggests that the number of expressed genes present in four different cDNA libraries of Arabidopsis thaliana varies from 9155 (root) to 12005 (silique). An observed fraction of co-expressed genes in two different EST sets as low as 25% can correspond to an actual overlap fraction greater than 65%.ConclusionThe proposed method provides a convenient tool for gene capture prediction and cDNA library property diagnosis in EST sequencing.

[1]  James M. Sikela,et al.  Single pass sequencing and physical and genetic mapping of human brain cDNAs , 1992, Nature Genetics.

[2]  D. Stekel,et al.  The comparison of gene expression from multiple cDNA libraries. , 2000, Genome research.

[3]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[4]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[5]  Robert Miller,et al.  STACK: Sequence Tag Alignment and Consensus Knowledgebase , 2001, Nucleic Acids Res..

[6]  G. Martin,et al.  Deductions about the Number, Organization, and Evolution of Genes in the Tomato Genome Based on Analysis of a Large Expressed Sequence Tag Collection and Selective Genomic Sequencing Article, publication date, and citation information can be found at www.plantcell.org/cgi/doi/10.1105/tpc.010478. , 2002, The Plant Cell Online.

[7]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[8]  J. Craig Venter,et al.  Sequence identification of 2,375 human brain genes , 1992, Nature.

[9]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[10]  John Quackenbush,et al.  Gene Index analysis of the human genome estimates approximately 120,000 genes , 2000, Nature Genetics.

[11]  Y. Nakamura,et al.  A large scale analysis of cDNA in Arabidopsis thaliana: generation of 12,028 non-redundant expressed sequence tags from normalized and size-selected cDNA libraries. , 2000, DNA research : an international journal for rapid publication of reports on genes and genomes.

[12]  B. Harshbarger An Introduction to Probability Theory and its Applications, Volume I , 1958 .

[13]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[14]  Ji-Ping Z. Wang,et al.  EST clustering error evaluation and correction , 2004, Bioinform..

[15]  B. Lindsay,et al.  A Penalized Nonparametric Maximum Likelihood Approach to Species Richness Estimation , 2005 .

[16]  J. Craig Venter,et al.  3,400 new expressed sequence tags identify diversity of transcripts in human brain , 1993, Nature Genetics.

[17]  M. Adams,et al.  How many genes in the human genome? , 1994, Nature Genetics.

[18]  L. Peltonen,et al.  Efficient discovery of single-nucleotide polymorphisms in coding regions of human genes , 2002, The Pharmacogenomics Journal.

[19]  Andrew J. Roger,et al.  Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys , 2004, Bioinform..

[20]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[21]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[22]  S. Salzberg,et al.  An optimized protocol for analysis of EST sequences. , 2000, Nucleic acids research.

[23]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.

[24]  T. Ideker,et al.  Mining SNPs from EST databases. , 1999, Genome research.

[25]  P. Green,et al.  Analysis of expressed sequence tags indicates 35,000 human genes , 2000, Nature Genetics.

[26]  J. Claverie Computational methods for the identification of differential and coordinated gene expression. , 1999, Human molecular genetics.

[27]  A. Chao,et al.  ESTIMATING THE NUMBER OF SHARED SPECIES IN TWO COMMUNITIES , 2000 .

[28]  Frank E. Grubbs,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[29]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[30]  K. Roeder,et al.  A Unified Treatment of Integer Parameter Models , 1987 .

[31]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[32]  J. Kiefer,et al.  CONSISTENCY OF THE MAXIMUM LIKELIHOOD ESTIMATOR IN THE PRESENCE OF INFINITELY MANY INCIDENTAL PARAMETERS , 1956 .

[33]  Christopher J. Lee,et al.  Genome-wide detection of alternative splicing in expressed sequences of human genes , 2001, Nucleic Acids Res..

[34]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[35]  Christopher J. Lee,et al.  Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. , 2002, Nucleic acids research.

[36]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[37]  Christopher J. Lee Generating Consensus Sequences from Partial Order Multiple Sequence Alignment Graphs , 2003, Bioinform..

[38]  B. Efron Nonparametric standard errors and confidence intervals , 1981 .