RNA-seq gene and transcript expression analysis using the BioExtract server and iPlant collaborative

Background: The development of Next Generation Sequencing (NGS) technology provides great opportunities to study gene expression, gene spliced transcripts, post-transcriptional changes, and gene fusion mutations/SNPs. The large amount of data being generated from these approaches presents many challenges. For example, how can we manage and analyze these vast datasets in order to extract new knowledge. Aims: This paper provides an integrated, adaptable, and scalable scenario to guide researchers through a complex, data analysis process using the iPlant Collaborative AGAVE RESTful API through the BioExtract Server. In 3 modules, we show how a High Performance Cluster (HPC) can be leveraged in a Workflow Management System (WMS) by following simple analytic steps. Results: A workflow has been developed in the BioExtract Server to analyze RNA-Seq data. The running of this workflow on a 21.6GB dataset provides reliable gene and transcript expression results. The BioExtract Server's results compared to an existing manual workflow on the same dataset shows ≈800% improvement in execution time (from ≈18h to ≈2h10min). Additionally, there are several qualitative improvements such as; automation, reproducibility, sharability, and scalability. (Note: the performance was not compared to the workflow installed at Galaxy, https://usegalaxy.org/, due to extensive wait times on their public site.) Our workflow execution provides analysis results from input datasets and reveals a 0.05 fold discovery rate (FDR) showing that 342 genes, 228 isoforms, 270 TSS, 47 CDS and 23 promoters are significantly differentially expressed. Conclusion: Having the ability to easily create and execute workflows leveraging the robust iPlant cyberinfrastructure to analyze NGS data represents one more steps in eScience initiative improvement. It improves, considerably, the ability of life science researchers to apply NGS tools. However, enhancements to this approach remains important as everyday improvements in HPC and WMS technology, techniques, and software continues. Our coming challenge will consist to follow that evolution in order to minimize the gap between researchers and these powerful resources. Availability: Tools used here are freely available on referenced link. Additional data analysis from our workflow execution is available on demand. Our workflow is available on MyExperiment under creative commons (cc) license (http://www.myexperiment.org/workflows/3895.html?version=1).

[1]  Rion Dooley,et al.  Life science data analysis workflow development using the bioextract server leveraging the iPlant collaborative cyberinfrastructure , 2015, Concurr. Comput. Pract. Exp..

[2]  Maria Mirto,et al.  A WorkFlow Management System for Bioinformatics Grid , 2005 .

[3]  Michael R. Speicher,et al.  A survey of tools for variant analysis of next-generation genome sequencing data , 2013, Briefings Bioinform..

[4]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[5]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[6]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[7]  Rion Dooley,et al.  Software-as-a-Service: The iPlant Foundation API , 2012 .

[8]  B. S. Manjunath,et al.  The iPlant Collaborative: Cyberinfrastructure for Plant Biology , 2011, Front. Plant Sci..

[9]  Janice Singer,et al.  How do scientists develop and use scientific software? , 2009, 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering.

[10]  Eric Gossett,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[11]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[12]  Rion Dooley,et al.  BioExtract Server, a Web-based workflow enabling system, leveraging iPlant collaborative resources , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[14]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[15]  Etienne Z. Gnimpieba,et al.  Using logic programming for modeling the one-carbon metabolism network to study the impact of folate deficiency on methylation processes. , 2011, Molecular bioSystems.

[16]  Jan Krüger,et al.  Personalized cloud-based bioinformatics services for research and education: use cases and the elasticHPC package , 2012, BMC Bioinformatics.

[17]  T. Henzinger,et al.  Executable cell biology , 2007, Nature Biotechnology.

[18]  Volker Brendel,et al.  BioExtract Server—An Integrated Workflow-Enabling System to Access and Analyze Heterogeneous, Distributed Biomolecular Data , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .