kakapo: Easy extraction and annotation of genes from raw RNA-seq reads

kakapo (kākāpō) is a python-based pipeline that allows users to extract and assemble one or more specified genes or gene families. It flexibly uses original RNA-seq read or GenBank SRA accession inputs without performing assembly of entire transcriptomes. The pipeline identifies open reading frames in the assembled gene transcripts and annotates them. It optionally filters raw reads for ribosomal, plastid, and mitochondrial reads, or reads belonging to non-target organisms (e.g., viral, bacterial, human). kakapo can be employed to extract arbitrary loci, such as those commonly used for phylogenetic inference in systematics or candidate genes and gene families in phylogenomic and metagenomic studies. We provide example applications and discuss how its use can offset the declining value of the GenBank’s single-gene databases and help assemble datasets for a variety of phylogenetic analyses.

[1]  K. Katz,et al.  The Sequence Read Archive: a decade more of explosive growth , 2021, Nucleic Acids Res..

[2]  B. Igić,et al.  RNase-based self-incompatibility in cacti. , 2021, The New phytologist.

[3]  Amelia A. Fuller,et al.  Changes at a Critical Branchpoint in the Anthocyanin Biosynthetic Pathway Underlie the Blue to Orange Flower Color Transition in Lysimachia arvensis , 2021, Frontiers in Plant Science.

[4]  Matthew G. Johnson,et al.  A Universal Probe Set for Targeted Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids Clustering , 2018, bioRxiv.

[5]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[6]  Liliana Florea,et al.  Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads , 2015, GigaScience.

[7]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[8]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[9]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[10]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[11]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[12]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[13]  J. Ragle,et al.  IUCN Red List of Threatened Species , 2010 .

[14]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[15]  Jonathan A. Eisen,et al.  Gastrogenomic delights: A movable feast , 1997, Nature Medicine.

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.