TARGeT: a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences

Gene families compose a large proportion of eukaryotic genomes. The rapidly expanding genomic sequence database provides a good opportunity to study gene family evolution and function. However, most gene family identification programs are restricted to searching protein databases where data are often lagging behind the genomic sequence data. Here, we report a user-friendly web-based pipeline, named TARGeT (Tree Analysis of Related Genes and Transposons), which uses either a DNA or amino acid ‘seed’ query to: (i) automatically identify and retrieve gene family homologs from a genomic database, (ii) characterize gene structure and (iii) perform phylogenetic analysis. Due to its high speed, TARGeT is also able to characterize very large gene families, including transposable elements (TEs). We evaluated TARGeT using well-annotated datasets, including the ascorbate peroxidase gene family of rice, maize and sorghum and several TE families in rice. In all cases, TARGeT rapidly recapitulated the known homologs and predicted new ones. We also demonstrated that TARGeT outperforms similar pipelines and has functionality that is not offered elsewhere.

[1]  Jingsha Xu,et al.  supplemental information , 2020 .

[2]  L. Holm,et al.  The Pfam protein families database , 2011, Nucleic Acids Res..

[3]  Wenfeng Qian,et al.  Gene Dosage and Gene Duplicability , 2008, Genetics.

[4]  W. H. Piel,et al.  PhyloWidget: web-based visualizations for the tree of life , 2008, Bioinform..

[5]  U. Stenzel,et al.  PatMaN: rapid alignment of short sequences to large databases , 2008, Bioinform..

[6]  Ariel Fernández,et al.  Protein Under-Wrapping Causes Dosage Sensitivity and Decreases Gene Duplicability , 2007, PLoS genetics.

[7]  Robert D. Finn,et al.  The Pfam protein families database , 2007, Nucleic Acids Res..

[8]  M. Nei,et al.  MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. , 2007, Molecular biology and evolution.

[9]  Wen Wang,et al.  FGF: A web tool for Fishing Gene Family in a whole genome database , 2007, Nucleic Acids Res..

[10]  Vassilios Ioannidis,et al.  PeroxiBase: the peroxidase database. , 2007, Phytochemistry.

[11]  Uta Bohnebeck,et al.  PhyloGena - a user-friendly system for automated phylogenetic annotation of unknown sequences , 2007, Bioinform..

[12]  Guy Perrière,et al.  HoSeqI: automated homologous sequence identification in gene family databases , 2006, Bioinform..

[13]  Paramvir S. Dehal,et al.  TreeFam: a curated database of phylogenetic trees of animal gene families , 2005, Nucleic Acids Res..

[14]  M. Morgante,et al.  Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize , 2005, Nature Genetics.

[15]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[16]  Leszek Rychlewski,et al.  FFAS03: a server for profile–profile sequence alignments , 2005, Nucleic Acids Res..

[17]  Paramvir S. Dehal,et al.  Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate , 2005, PLoS biology.

[18]  Hiroaki Kitano,et al.  The PANTHER database of protein families, subfamilies, functions and pathways , 2004, Nucleic Acids Res..

[19]  M. Margis-Pinheiro,et al.  Analysis of the Molecular Evolutionary History of the Ascorbate Peroxidase Gene Family: Inferences from the Rice Genome , 2004, Journal of Molecular Evolution.

[20]  R. Mittler,et al.  Reactive oxygen gene network of plants. , 2004, Trends in plant science.

[21]  Sean R. Eddy,et al.  Pack-MULE transposable elements mediate gene evolution in plants , 2004, Nature.

[22]  Matthew Hurles,et al.  Gene Duplication: The Genomic Trade in Spare Parts , 2004, PLoS biology.

[23]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[24]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[25]  B. Dujon,et al.  Eucaryotic genome evolution through the spontaneous duplication of large chromosomal segments , 2004, The EMBO journal.

[26]  E. Eichler,et al.  An Alu transposition model for the origin and expansion of human segmental duplications. , 2003, American journal of human genetics.

[27]  C. Pál,et al.  Dosage sensitivity and the evolution of gene families in yeast , 2003, Nature.

[28]  Junjun Zhang,et al.  Recent segmental and gene duplications in the mouse genome , 2003, Genome Biology.

[29]  Michael Q. Zhang,et al.  GFScan: A Gene Family Search Tool at Genomic DNA Level , 2002 .

[30]  E. Koonin,et al.  The role of lineage-specific gene family expansion in the evolution of eukaryotes. , 2002, Genome research.

[31]  J. Salse,et al.  Synteny between Arabidopsis thaliana and rice at the genome level: a tool to identify conservation in the ongoing rice genome sequencing project. , 2002, Nucleic acids research.

[32]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) , 2002, Science.

[33]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[34]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[35]  I. Karsch-Mizrachi,et al.  The GenBank Sequence , 2002 .

[36]  Cédric Feschotte,et al.  Mariner-like transposases are widespread and diverse in flowering plants , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[37]  E. Eichler,et al.  Recent duplication, domain accretion and the dynamic mutation of the human genome. , 2001, Trends in genetics : TIG.

[38]  M. Morgante,et al.  Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome. , 2001, Genome research.

[39]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[40]  Steven J. M. Jones,et al.  PhyloBLAST: facilitating phylogenetic analysis of BLAST results , 2001, Bioinform..

[41]  Z. Gu,et al.  Evolutionary analyses of the human genome , 2001, Nature.

[42]  T. Sicheritz-Pontén,et al.  A phylogenomic approach to microbial evolution. , 2001, Nucleic acids research.

[43]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[44]  L Holm,et al.  Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[45]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[46]  K. H. Wolfe,et al.  Molecular evidence for an ancient duplication of the entire yeast genome , 1997, Nature.

[47]  Roderic D. M. Page,et al.  TreeView: an application to display phylogenetic trees on personal computers , 1996, Comput. Appl. Biosci..

[48]  J. Neitz,et al.  Numbers and ratios of visual pigment genes for normal red-green color vision , 1995, Science.

[49]  T. Heidmann,et al.  Generation of processed pseudogenes in murine cells. , 1993, The EMBO journal.

[50]  A. Clark,et al.  Ribosomal DNA and Stellate gene copy number variation on the Y chromosome of Drosophila melanogaster. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[51]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[52]  Minoru Kanehisa,et al.  The GenBank nucleic acid sequence database , 1985, Comput. Appl. Biosci..

[53]  Dayhoff Mo,et al.  The origin and evolution of protein superfamilies. , 1976 .

[54]  Sudhir Kumar,et al.  Comparative Genomics in Eukaryotes , 2005 .

[55]  Andrei N Lupas,et al.  PhyloGenie: automated phylome generation and analysis. , 2004, Nucleic acids research.

[56]  Jonathan F. Wendel,et al.  Genome evolution in polyploids , 2004, Plant Molecular Biology.

[57]  A. Oliphant,et al.  A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). , 2002, Science.

[58]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[59]  S. Eddy Profile hidden Markov models , 1998, Bioinform..

[60]  B F Ouellette,et al.  The GenBank sequence database. , 1998, Methods of biochemical analysis.

[61]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[62]  E. Vanin,et al.  Processed pseudogenes: characteristics and evolution. , 1985, Annual review of genetics.

[63]  M. O. Dayhoff,et al.  The origin and evolution of protein superfamilies. , 1976, Federation proceedings.