Aligner optimization increases accuracy and decreases compute times in multi-species sequence data

As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows–Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi) and one minority member (i.e. human or the Wolbachia endosymbiont wBm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium, at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium–human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.

[1]  C. Fraser,et al.  Efficient Enrichment of Bacterial mRNA from Host-Bacteria Total RNA Samples , 2016, Scientific Reports.

[2]  C. Fraser,et al.  An integrated genomic and transcriptomic survey of mucormycosis-causing fungi , 2016, Nature Communications.

[3]  J. Veening,et al.  Time-resolved dual RNA-seq reveals extensive rewiring of lung epithelial and pneumococcal transcriptomes during early infection , 2016, Genome Biology.

[4]  Konrad U. Förstner,et al.  Dual RNA-seq unveils noncoding RNA functions in host–pathogen interactions , 2016, Nature.

[5]  Steven J. M. Jones,et al.  Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma. , 2016, The New England journal of medicine.

[6]  R. Rappuoli,et al.  Dual RNA-seq of Nontypeable Haemophilus influenzae and Host Cell Transcriptomes Reveals Novel Insights into Host-Pathogen Cross Talk , 2015, mBio.

[7]  Steven J. M. Jones,et al.  The Molecular Taxonomy of Primary Prostate Cancer , 2015, Cell.

[8]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[9]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[10]  A. Regev,et al.  Pathogen Cell-to-Cell Variability Drives Heterogeneity in Host Immune Responses , 2015, Cell.

[11]  Aviv Regev,et al.  Pathogen Cell-to-Cell Variability Drives Heterogeneity in Host Immune Responses , 2015, Cell.

[12]  Steven J. M. Jones,et al.  Genomic Classification of Cutaneous Melanoma , 2015, Cell.

[13]  Anup Mahurkar,et al.  New signaling pathways govern the host response to C. albicans infection in various niches , 2015, Genome research.

[14]  A. Shetty,et al.  Transcriptomic Analysis of Vulvovaginal Candidiasis Identifies a Role for the NLRP3 Inflammasome , 2015, mBio.

[15]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[16]  H. Mollenkopf,et al.  Comprehensive insights into transcriptional adaptation of intracellular mycobacteria by microbe-enriched dual RNA sequencing , 2015, BMC Genomics.

[17]  Steven J. M. Jones,et al.  Comprehensive genomic characterization of head and neck squamous cell carcinomas , 2015, Nature.

[18]  M. Sweet,et al.  The co‐transcriptome of uropathogenic E scherichia coli‐infected mouse macrophages reveals new insights into host–pathogen interactions , 2015, Cellular microbiology.

[19]  Steven J. M. Jones,et al.  Integrated Genomic Characterization of Papillary Thyroid Carcinoma , 2014, Cell.

[20]  D. Kwiatkowski,et al.  Optimized Whole-Genome Amplification Strategy for Extremely AT-Biased Template , 2014, DNA research : an international journal for rapid publication of reports on genes and genomes.

[21]  Lawrence A. Donehower,et al.  The somatic genomic landscape of chromophobe renal cell carcinoma. , 2014, Cancer cell.

[22]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of gastric adenocarcinoma , 2014, Nature.

[23]  Steven J. M. Jones,et al.  Comprehensive molecular profiling of lung adenocarcinoma , 2014, Nature.

[24]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of urothelial bladder carcinoma , 2014, Nature.

[25]  D. Haussler,et al.  The Somatic Genomic Landscape of Glioblastoma , 2013, Cell.

[26]  David R. Riley,et al.  Extensively duplicated and transcriptionally active recent lateral gene transfer from a bacterial Wolbachia endosymbiont to its host filarial nematode Brugia malayi , 2013, BMC Genomics.

[27]  Gabor T. Marth,et al.  MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[28]  D. Foster,et al.  Laser microdissection coupled with RNA-seq analysis of porcine enterocytes infected with an obligate intracellular pathogen (Lawsonia intracellularis) , 2013, BMC Genomics.

[29]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of clear cell renal cell carcinoma , 2013, Nature.

[30]  David R. Riley,et al.  Bacteria-Human Somatic Cell Lateral Gene Transfer Is Enriched in Cancer Samples , 2013, PLoS Comput. Biol..

[31]  Benjamin J. Raphael,et al.  Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. , 2013, The New England journal of medicine.

[32]  Steven J. M. Jones,et al.  Integrated genomic characterization of endometrial carcinoma , 2013, Nature.

[33]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[34]  U. Farooq,et al.  Unusual domain architecture of aminoacyl tRNA synthetases and their paralogs from Leishmania major , 2012, BMC Genomics.

[35]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[36]  Steven J. M. Jones,et al.  Comprehensive genomic characterization of squamous cell lung cancers , 2012, Nature.

[37]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of human colon and rectal cancer , 2012, Nature.

[38]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[39]  John C. Tan,et al.  Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing , 2012, Nature.

[40]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[41]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[42]  K. Kuchler,et al.  An Interspecies Regulatory Network Inferred from Simultaneous RNA-seq of Candida albicans Invading Innate Immune Cells , 2012, Front. Microbio..

[43]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[44]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[45]  G. Getz,et al.  PathSeq: software to identify or discover microbes by deep sequencing of human tissue , 2011, Nature Biotechnology.

[46]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[47]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[48]  Jonathan E. Allen,et al.  Draft Genome of the Filarial Nematode Parasite Brugia malayi , 2007, Science.

[49]  Natalia N. Ivanova,et al.  The Wolbachia Genome of Brugia malayi: Endosymbiont Evolution within a Human Pathogenic Nematode , 2005, PLoS biology.

[50]  Jonathan E. Allen,et al.  Genome sequence of the human malaria parasite Plasmodium falciparum , 2002, Nature.

[51]  The Cancer Genome Atlas Research Network,et al.  Comprehensive molecular characterization of urothelial bladder carcinoma , 2014, Nature.

[52]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[53]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[54]  Claude-Alain H. Roten,et al.  Fast and accurate short read alignment with Burrows–Wheeler transform , 2009, Bioinform..