Unique Features of the Loblolly Pine (Pinus taeda L.) Megagenome Revealed Through Sequence Annotation

The largest genus in the conifer family Pinaceae is Pinus, with over 100 species. The size and complexity of their genomes (∼20–40 Gb, 2n = 24) have delayed the arrival of a well-annotated reference sequence. In this study, we present the annotation of the first whole-genome shotgun assembly of loblolly pine (Pinus taeda L.), which comprises 20.1 Gb of sequence. The MAKER-P annotation pipeline combined evidence-based alignments and ab initio predictions to generate 50,172 gene models, of which 15,653 are classified as high confidence. Clustering these gene models with 13 other plant species resulted in 20,646 gene families, of which 1554 are predicted to be unique to conifers. Among the conifer gene families, 159 are composed exclusively of loblolly pine members. The gene models for loblolly pine have the highest median and mean intron lengths of 24 fully sequenced plant genomes. Conifer genomes are full of repetitive DNA, with the most significant contributions from long-terminal-repeat retrotransposons. In depth analysis of the tandem and interspersed repetitive content yielded a combined estimate of 82%.

[1]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[2]  Jens Nielsen,et al.  Imbalance of heterologous protein folding and disulfide bond formation rates yields runaway oxidative stress , 2012, BMC Biology.

[3]  T. Bureau,et al.  Terminal-repeat retrotransposons in miniature (TRIM) are involved in restructuring plant genomes , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Mark Yandell,et al.  MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects , 2011, BMC Bioinformatics.

[5]  C. Feschotte,et al.  Plant Transposable Elements: Biology and Evolution , 2012 .

[6]  D. Peterson,et al.  Characterization of the genome of bald cypress , 2011, BMC Genomics.

[7]  S. Zhong,et al.  High-throughput illumina strand-specific RNA sequencing library preparation. , 2011, Cold Spring Harbor protocols.

[8]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[9]  Y. van de Peer,et al.  Dissecting Plant Genomes with the PLAZA Comparative Genomics Platform1[W] , 2011, Plant Physiology.

[10]  Darío Guerrero-Fernández,et al.  EuroPineDB: a high-coverage web database for maritime pine transcriptome , 2011, BMC Genomics.

[11]  A. Liston,et al.  Fossil calibration of molecular divergence infers a moderate mutation rate and recent radiations for pinus. , 2006, Molecular biology and evolution.

[12]  D. Neale,et al.  Combination of multipoint maximum likelihood (MML) and regression mapping algorithms to construct a high-density genetic linkage map for loblolly pine (Pinus taeda L.) , 2013, Tree Genetics & Genomes.

[13]  R. Quatrano Genomics , 1998, Plant Cell.

[14]  S. Lewis,et al.  The generic genome browser: a building block for a model organism system database. , 2002, Genome research.

[15]  M. García-Gil,et al.  Evolutionary Aspects of Functional and Pseudogene Members of the Phytochrome Gene Family in Scots Pine , 2008, Journal of Molecular Evolution.

[16]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[17]  J. Bennetzen,et al.  Transposable elements, gene creation and genome rearrangement in flowering plants. , 2005, Current opinion in genetics & development.

[18]  Le-Shin Wu,et al.  Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies , 2014, Genome Biology.

[19]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[20]  L. Chasin,et al.  Multiple Splicing Defects in an Intronic False Exon , 2000, Molecular and Cellular Biology.

[21]  J. Bousquet,et al.  The evolutionary implications of knox-I gene duplications in conifers: correlated evidence from phylogeny, gene mapping, and analysis of functional divergence. , 2004, Molecular biology and evolution.

[22]  A. Jeffreys,et al.  Repeat instability at human minisatellites arising from meiotic recombination , 1998, The EMBO journal.

[23]  Inanç Birol,et al.  Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data , 2013, Bioinform..

[24]  Mukesh Jain,et al.  Genome-wide analysis of intronless genes in rice and Arabidopsis , 2008, Functional & Integrative Genomics.

[25]  M. Timko,et al.  Loblolly pine (Pinus taeda L.) contains multiple expressed genes encoding light-dependent NADPH:protochlorophyllide oxidoreductase (POR). , 1998, Plant & cell physiology.

[26]  J. Heslop-Harrison,et al.  Diversity, origin, and distribution of retrotransposons (gypsy and copia) in conifers. , 2001, Molecular biology and evolution.

[27]  Jill L. Wegrzyn,et al.  TreeGenes: A Forest Tree Genome Database , 2008, International journal of plant genomics.

[28]  J. Cairney,et al.  Expressed Sequence Tags from loblolly pine embryos reveal similarities with angiosperm embryogenesis , 2006, Plant Molecular Biology.

[29]  Jirí Macas,et al.  PlantSat: a specialized database for plant satellite repeats , 2002, Bioinform..

[30]  J. Dean,et al.  Water stress-responsive genes in loblolly pine (Pinus taeda) roots identified by analyses of expressed sequence tag libraries. , 2006, Tree physiology.

[31]  P. Xiao,et al.  The first insight into the Taxus genome via fosmid library construction and end sequencing , 2011, Molecular Genetics and Genomics.

[32]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[33]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[34]  Nansheng Chen,et al.  Genome-Wide Comparative Gene Family Classification , 2010, PloS one.

[35]  Jeffrey Ross-Ibarra,et al.  Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution , 2012, Genome Biology.

[36]  G. Parra,et al.  Comparative and functional analysis of intron-mediated enhancement signals reveals conserved features among plants , 2011, Nucleic acids research.

[37]  Jianxin Ma,et al.  Consistent over-estimation of gene number in complex plant genomes. , 2004, Current opinion in plant biology.

[38]  R. Sederoff,et al.  Analysis of xylem formation in pine by cDNA sequencing. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[39]  J. Bohlmann,et al.  Targeted isolation, sequence assembly and characterization of two white spruce (Picea glauca) BAC clones for terpenoid synthase and cytochrome P450 genes involved in conifer defence reveal insights into a conifer genome , 2009, BMC Plant Biology.

[40]  R. Sederoff,et al.  Apparent homology of expressed genes from wood-forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[41]  B. Dujon,et al.  Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes , 2008, Microbiology and Molecular Biology Reviews.

[42]  S. Tanksley,et al.  Macrostructure of the tomato telomeres. , 1991, The Plant cell.

[43]  Darren A. Miller,et al.  Intercropping switchgrass with loblolly pine does not influence the functional role of the white-footed mouse (Peromyscus leucopus) , 2013 .

[44]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[45]  Mihaela M. Martis,et al.  The Sorghum bicolor genome and the diversification of grasses , 2009, Nature.

[46]  A. Paterson,et al.  Patterns of tandem repetition in plant whole genome assemblies , 2009, Molecular Genetics and Genomics.

[47]  John R. Butnor,et al.  Meeting global policy commitments: carbon sequestration and southern pine forests , 2001 .

[48]  Aurélien Grosdidier,et al.  APDB: a novel measure for benchmarking sequence alignment methods without reference alignments , 2003, ISMB.

[49]  T. Flutre,et al.  Considering Transposable Element Diversification in De Novo Annotation Approaches , 2011, PloS one.

[50]  Stijn van Dongen,et al.  Using MCL to extract clusters from networks. , 2012, Methods in molecular biology.

[51]  O. Panaud,et al.  Comparative Genomic Paleontology across Plant Kingdom Reveals the Dynamics of TE-Driven Genome Evolution , 2013, Genome biology and evolution.

[52]  F. Ausubel,et al.  Isolation of a higher eukaryotic telomere from Arabidopsis thaliana , 1988, Cell.

[53]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[54]  Mario Stanke,et al.  Gene prediction with a hidden Markov model and a new intron submodel , 2003, ECCB.

[55]  G. Kletetschka,et al.  Analysis of telomere length and telomerase activity in tree species of various lifespans, and with age in the bristlecone pine Pinus longaeva. , 2006, Rejuvenation research.

[56]  A. Liston,et al.  Variation in the nrDNA ITS of Pinus subsection Cembroides: implications for molecular systematic studies of pine species complexes. , 2001, Molecular phylogenetics and evolution.

[57]  Douglas G. Scofield,et al.  The Norway spruce genome sequence and conifer genome evolution , 2013, Nature.

[58]  Eugene W. Myers,et al.  PILER: identification and classification of genomic repeats , 2005, ISMB.

[59]  Matthieu Legendre,et al.  Variable tandem repeats accelerate evolution of coding and regulatory sequences. , 2010, Annual review of genetics.

[60]  T. Aronen,et al.  Variation in telomeric repeats of Scots pine (Pinus sylvestris L.) , 2011, Tree Genetics & Genomes.

[61]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[62]  G. Coop,et al.  Back to nature: ecological genomics of loblolly pine (Pinus taeda, Pinaceae) , 2010, Molecular ecology.

[63]  Mark Yandell,et al.  The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences , 2010, BMC Genomics.

[64]  Christopher D Town,et al.  A first survey of the rye (Secale cereale) genome composition through BAC end sequencing of the short arm of chromosome 1R , 2008, BMC Plant Biology.

[65]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[66]  I. Korf,et al.  Longer First Introns Are a General Property of Eukaryotic Gene Structure , 2008, PloS one.

[67]  Conifer DBMagic: a database housing multiple de novo transcriptome assemblies for 12 diverse conifer species , 2012, Tree Genetics & Genomes.

[68]  Philippe Rigault,et al.  A spruce gene map infers ancient plant genome reshuffling and subsequent slow evolution in the gymnosperm lineage leading to extant conifers , 2012, BMC Biology.

[69]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[70]  Asan,et al.  The genome of the cucumber, Cucumis sativus L. , 2009, Nature Genetics.

[71]  J. Boeke,et al.  Replication infidelity during a single cycle of Ty1 retrotransposition. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[72]  J. Bennetzen,et al.  A unified classification system for eukaryotic transposable elements , 2007, Nature Reviews Genetics.

[73]  M. Morgante,et al.  Intimate association of microsatellite repeats with retrotransposons and other dispersed repetitive elements in barley. , 1999, The Plant journal : for cell and molecular biology.

[74]  J. Carlson,et al.  Evolution of Genome Size and Complexity in Pinus , 2009, PloS one.

[75]  Claude W. dePamphilis,et al.  Ancestral polyploidy in seed plants and angiosperms , 2011, Nature.

[76]  Jill L. Wegrzyn,et al.  Insights into the Loblolly Pine Genome: Characterization of BAC and Fosmid Sequences , 2013, PloS one.

[77]  Garth R. Brown,et al.  Nucleotide diversity and linkage disequilibrium in loblolly pine. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[78]  M. Sugiura,et al.  Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[79]  Samuel S. Shepard,et al.  The Peculiarities of Large Intron Splicing in Animals , 2009, PloS one.

[80]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[81]  C. Feschotte,et al.  DNA transposons and the evolution of eukaryotic genomes. , 2007, Annual review of genetics.

[82]  T. Schmidt LINEs, SINEs and repetitive DNA: non-LTR retrotransposons in plant genomes , 1999, Plant Molecular Biology.

[83]  P. Rigault,et al.  A White Spruce Gene Catalog for Conifer Genome Analyses1[W][OA] , 2011, Plant Physiology.

[84]  G. Parra,et al.  Promoter-Proximal Introns in Arabidopsis thaliana Are Enriched in Dispersed Signals that Elevate Gene Expression[W][OA] , 2008, The Plant Cell Online.

[85]  D. Neale,et al.  Association genetics of oleoresin flow in loblolly pine: discovering genes and predicting phenotype for improved resistance to bark beetles and bioenergy potential. , 2013, The New phytologist.

[86]  Mark H. Wright,et al.  High-throughput genotyping and mapping of single nucleotide polymorphisms in loblolly pine (Pinus taeda L.) , 2008, Tree Genetics & Genomes.

[87]  Susan R. Wessler,et al.  MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences , 2010, Nucleic acids research.

[88]  J. Bennetzen,et al.  Plant retrotransposons. , 1999, Annual review of genetics.

[89]  Jeffrey P. Prestemon,et al.  The Southern Timber Market to 2040 , 2002 .

[90]  K. Fukui,et al.  A new gypsy-type retrotransposon, RIRE7: preferential insertion into the tandem repeat sequence TrsD in pericentromeric heterochromatin regions of rice chromosomes , 2001, Molecular Genetics and Genomics.

[91]  Steven J. M. Jones,et al.  Analysis of 4,664 high-quality sequence-finished poplar full-length cDNA clones and their utility for the discovery of genes responding to insect feeding , 2008, BMC Genomics.

[92]  D. Neale,et al.  The Evolutionary Genetics of the Genes Underlying Phenotypic Associations for Loblolly Pine (Pinus taeda, Pinaceae) , 2013, Genetics.

[93]  J. Poulain,et al.  The genome of the mesopolyploid crop species Brassica rapa , 2011, Nature Genetics.

[94]  B. Ziegenhagen,et al.  Evolution of Genome Size in Conifers , 2005 .

[95]  T. Harkins,et al.  Genome-wide characterization of simple sequence repeats in cucumber (Cucumis sativus L.) , 2010, BMC Genomics.

[96]  A. G. Abbott,et al.  Uniform standards for genome databases in forest and fruit trees , 2012, Tree Genetics & Genomes.

[97]  G. Hong,et al.  Preferential Location of MITEs in Rice Genome. , 2000, Sheng wu hua xue yu sheng wu wu li xue bao Acta biochimica et biophysica Sinica.

[98]  P. Civáň,et al.  On the Coevolution of Transposable Elements and Plant Genomes , 2011 .

[99]  J. Nap,et al.  In plants, highly expressed genes are the least compact. , 2006, Trends in genetics : TIG.

[100]  A. Noormets,et al.  Response of carbon fluxes to drought in a coastal plain loblolly pine forest , 2010 .

[101]  Z. Magbanua,et al.  Adventures in the Enormous: A 1.8 Million Clone BAC Library for the 21.7 Gb Genome of Loblolly Pine , 2011, PloS one.

[102]  R. Amasino,et al.  siRNAs targeting an intronic transposon in the regulation of natural flowering behavior in Arabidopsis. , 2004, Genes & development.

[103]  Yadan Luo,et al.  Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation , 2013, Nature.

[104]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[105]  Ofer Peleg,et al.  Large Retrotransposon Derivatives: Abundant, Conserved but Nonautonomous Retroelements of Barley and Related Genomes , 2004, Genetics.

[106]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[107]  M. Hizume,et al.  Cloning of DNA sequences localized on proximal fluorescent chromosome bands by microdissection in Pinus densiflora Sieb. & Zucc. , 2001, Chromosoma.

[108]  Jerzy Jurka,et al.  Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor , 2006, BMC Bioinformatics.

[109]  L. Biesecker,et al.  Unfolding the role of chaperones and chaperonins in human disease. , 2001, Trends in genetics : TIG.

[110]  A. Leitch,et al.  Ecological and genetic factors linked to contrasting genome dynamics in seed plants. , 2012, The New phytologist.

[111]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[112]  G. Kletetschka,et al.  Analysis of telomere length and telomerase activity in tree species of various life-spans, and with age in the bristlecone pine Pinus longaeva , 2005, Biogerontology.

[113]  S. Salzberg,et al.  Sequencing and Assembly of the 22-Gb Loblolly Pine Genome , 2014, Genetics.