14. Genome Assembly and Annotation Process

The primary data produced by genome sequencing projects are often highly fragmented and sparsely annotated. This is especially true for the Human Genome Project [http://www.genome.gov/ page.cfm?pageID=10001772] as a result of its policy of releasing sequence data to the public sequence databases every day (1, 2). So that individual researchers do not have to piece together extended segments of a genome and then relate the sequence to genetic maps and known genes, NCBI provides annotated assemblies of public genome sequence data. NCBI assimilates data of various types, from numerous sources, to provide an integrated view of a genome, making it easier for researchers to spot informative relationships that might not have been apparent from looking at the primary data. The annotated genomes can be explored using Map Viewer (Chapter 20) to display different types of data side-by-side and to follow links between related pieces of data. This chapter describes the series of steps, the “pipeline”, that produces NCBI's annotated genome assembly from data deposited in the public sequence databases. A variant of the annotation process developed for the human genome is used to annotate the mouse genome, and similar procedures will be applied to other genomes (Box 1). NCBI constantly strives to improve the accuracy of its human genome assembly and annotation, to make the data displays more informative, and to enhance the utility of our access tools. Each run through the assembly and annotation procedure, together with feedback from outside groups and individual users, is used to improve the process, refine the parameters for individual steps, and add new features. Consequently, the details of the assembly and annotation process change from one run to the next. This chapter, therefore, describes the overall human genome assembly and annotation process and provides short descriptions of the key steps, but it does not detail specific procedures or parameters. However, sufficient detail is provided to enable users of our assembly and annotations to become familiar with the complexities and possible limitations of the data we provide.

[1]  Cécile Fizames,et al.  A comprehensive genetic map of the human genome based on 5,264 microsatellites , 1996, Nature.

[2]  P. Deloukas,et al.  A Gene Map of the Human Genome , 1996, Science.

[3]  D R Bentley,et al.  Genomic Sequence Information Should Be Released Immediately and Freely in the Public Domain , 1996, Science.

[4]  Gregory D Schuler,et al.  Sequence mapping by electronic PCR , 1997, Genome research.

[5]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[6]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[7]  P. Lijnzaad,et al.  A physical map of 30,000 human genes. , 1998, Science.

[8]  M. Guyer Statement on the rapid release of genomic DNA sequence. , 1998, Genome research.

[9]  J C Murray,et al.  Pediatrics and , 1998 .

[10]  J. Jurka,et al.  Repeats in genomic DNA: mining and meaning. , 1998, Current opinion in structural biology.

[11]  A. Smit Interspersed repeats and other mementos of transposable elements in mammalian genomes. , 1999, Current opinion in genetics & development.

[12]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 , 1999, Nucleic Acids Res..

[13]  G. Schuler,et al.  Making effective use of human genomic sequence data. , 1999, Trends in genetics : TIG.

[14]  G. Mahairas,et al.  Sequence-tagged connectors: a sequence approach to mapping and scanning the human genome. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  K. Sirotkin,et al.  dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. , 1999, Genome research.

[16]  G. Mahairas,et al.  Human BAC ends quality assessment and sequence analyses. , 2000, Genomics.

[17]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[18]  Alex E. Lash,et al.  A systematic, high-resolution linkage of the cytogenetic and physical maps of the human genome , 2000, Nature Genetics.

[19]  R. Agarwala,et al.  A fast and scalable radiation hybrid map construction and integration strategy. , 2000, Genome research.

[20]  K. Katz,et al.  Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. , 2000, Trends in genetics : TIG.

[21]  B. Trask,et al.  A High-Resolution Radiation Hybrid Map of the Human Genome Draft Sequence , 2001, Science.

[22]  D. Haussler,et al.  Integration of cytogenetic landmarks into the draft sequence of the human genome , 2001, Nature.

[23]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[24]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[25]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[26]  Benjamin A. Shoemaker,et al.  CDD: a database of conserved domain alignments with links to domain three-dimensional structure , 2002, Nucleic Acids Res..

[27]  D. Gudbjartsson,et al.  A high-resolution recombination map of the human genome , 2002, Nature Genetics.