Biological Pathway Analysis for De Novo Transcriptomes through Multiple Reference Species Selections

For de novo transcriptome analysis, choosing a closest reference model specie in terms of evolutionary distance is a general approach for gene mapping and genome annotations. However, not every selected reference model species possesses comprehensive genome annotations and curated information, and the total number of mapped genes from the selected reference species could not be fully expected either. Due to inefficient mapped genes from the selected reference model species, the following functional pathway analysis on transcriptome datasets would be seriously affected. To solve this problem, we proposed an improved approach based on multiple reference model species selection, especially for KEGG pathway analysis on differentially expressed genes. Applying union operations on individually mapped genes from different selected species, we could significantly promote the integrity of gene mapping results in KEGG pathways and provide realistic P-values for each identified pathway. Furthermore, based on mapped genes and KGML datasets, we applied various gray-levels, colors and shapes to present gene expression conditions on each biological pathway. Taking NGS transcriptomic datasets from an unknown Antarctic green alga species as an experimental example and selecting three published known species including Chlamydomonas reinhardtii, Chlorella variabilis, and Coccomyxa subellipsoidea as candidate reference species, we compared the results of pathway enrichment analysis by adopting different selections of reference species. We found that integrating all mapped genes from various model species provided a better result compared to using any single reference species. Some missed important biological pathways could be retrieved under an identical threshold setting of P-value, such as Ribosome, Pyrimidine metabolism and ABC transporters pathways. Therefore, we believe appropriate selection of multiple reference species is necessary and significant for transcriptome analysis on de novo species.

[1]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[2]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[3]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[4]  Wen-Chi Chang,et al.  AlgaePath: comprehensive analysis of metabolic pathways using transcript abundance data from next-generation sequencing in green algae , 2014, BMC Genomics.

[5]  H. Hoekstra,et al.  Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species , 2012, PloS one.

[6]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[7]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[8]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[9]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[10]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[11]  Alessandro Vullo,et al.  Ensembl 2015 , 2014, Nucleic Acids Res..

[12]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[13]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..