Nephele: genotyping via complete composition vectors and MapReduce

BackgroundCurrent sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.ResultsNephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours.ConclusionsWe conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.

[1]  Anthony S. Fauci,et al.  Race against time , 2005, Nature.

[2]  Iain M. Wallace,et al.  M-Coffee: combining multiple sequence alignment methods with T-Coffee , 2006, Nucleic acids research.

[3]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[4]  Marc E. Colosimo,et al.  TreeViewJ: An application for viewing and analyzing phylogenetic trees , 2007, Source Code for Biology and Medicine.

[5]  Korbinian Strimmer,et al.  PAL: an object-oriented programming library for molecular evolution and phylogenetics , 2001, Bioinform..

[6]  Guoqing Lu,et al.  FluGenome: a web tool for genotyping influenza A virus , 2007, Nucleic Acids Res..

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  N. Johnson The MITRE corporation , 1961, ACM National Meeting.

[9]  J Goodman,et al.  The value of a database in surveillance and vaccine selection , 2001 .

[10]  Y. Guan,et al.  Genesis of a highly pathogenic and potentially pandemic H5N1 influenza virus in eastern Asia , 2004, Nature.

[11]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[12]  Randy Goebel,et al.  Identifying a few foot-and-mouth disease virus signature nucleotide strings for computational genotyping , 2008, BMC Bioinformatics.

[13]  Randy Goebel,et al.  Nucleotide composition string selection in HIV-1 subtyping using whole genomes , 2007, Bioinform..

[14]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[15]  R F Schinazi,et al.  A new genotype of hepatitis B virus: complete genome and phylogenetic relatedness. , 2000, The Journal of general virology.

[16]  Suzanne J. Matthews,et al.  MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees , 2010, BMC Bioinformatics.

[17]  Bryan T Grenfell,et al.  Whole-Genome Analysis of Human Influenza A Virus Reveals Multiple Persistent Lineages and Reassortment among Recent H3N2 Viruses , 2005, PLoS biology.

[18]  Guo-Ping Zhao,et al.  In silico and microarray-based genomic approaches to identifying potential vaccine candidates against Leptospira interrogans , 2006, BMC Genomics.

[19]  M Lindh,et al.  Genotypes, nt 1858 variants, and geographic origin of hepatitis B virus--large-scale analysis using a new genotyping method. , 1997, The Journal of infectious diseases.

[20]  Kia Peyvan,et al.  CombiMatrix oligonucleotide arrays: genotyping and gene expression assays employing electrochemical detection. , 2007, Biosensors & bioelectronics.

[21]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[22]  Z. Xuan,et al.  Phylogeny Based on Whole Genome as inferred from Complete Information Set Analysis , 2002, Journal of biological physics.

[23]  Young Hyun,et al.  Visualising very large phylogenetic trees in three dimensional hyperbolic space , 2004, BMC Bioinformatics.

[24]  S A McEwen,et al.  Microbial forensics for natural and intentional incidents of infectious disease involving animals. , 2006, Revue scientifique et technique.

[25]  James Bullard,et al.  panjo : a parallel neighbor joining algorithm , 2007 .

[26]  G. Giribet,et al.  Exploring the Behavior of POY, a Program for Direct Optimization of Molecular Data , 2001, Cladistics : the international journal of the Willi Hennig Society.

[27]  David Swofford,et al.  Inferring Evolutionary Trees with PAUP* , 2003, Current protocols in bioinformatics.

[28]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[29]  Christian Stolte,et al.  TB database: an integrated platform for tuberculosis research , 2008, Nucleic Acids Res..

[30]  Bruce Budowle,et al.  Toward a System of Microbial Forensics: from Sample Collection to Interpretation of Evidence , 2005, Applied and Environmental Microbiology.

[31]  J. Beckmann,et al.  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. , 1986, Journal of biomolecular structure & dynamics.

[32]  Feng Luo,et al.  A quantitative genotype algorithm reflecting H5N1 Avian influenza niches , 2007, Bioinform..

[33]  Michael A. Gonzalez,et al.  From genome to vaccine: in silico predictions, ex vivo verification. , 2001, Vaccine.

[34]  D. Relman,et al.  Microbial Forensics--"Cross-Examining Pathogens" , 2002, Science.

[35]  Zu-Guo Yu,et al.  Origin and phylogeny of chloroplasts revealed by a simple correlation analysis of complete genomes. , 2003, Molecular biology and evolution.

[36]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[37]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[38]  J. Derisi,et al.  Microarray-based detection and genotyping of viral pathogens , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[40]  Erich Bornberg-Bauer,et al.  TreeWiz: interactive exploration of huge trees , 2002, Bioinform..

[41]  C. Viboud,et al.  Explorer The genomic and epidemiological dynamics of human influenza A virus , 2016 .

[42]  J. Retief,et al.  Phylogenetic analysis using PHYLIP. , 2000, Methods in molecular biology.

[43]  Daniel Janies,et al.  Genomic analysis and geographic visualization of the spread of avian influenza (H5N1). , 2007, Systematic biology.

[44]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[45]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[46]  Philip Hugenholtz,et al.  NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes , 2006, Nucleic Acids Res..

[47]  Ji Qi,et al.  Prokaryote phylogeny meets taxonomy: An exhaustive comparison of composition vector trees with systematic bacteriology , 2007, Science in China Series C: Life Sciences.

[48]  Yi Guan,et al.  Molecular analysis of avian H7 influenza viruses circulating in Eurasia in 1999-2005: detection of multiple reassortant virus genotypes. , 2008, The Journal of general virology.

[49]  Guohui Lin,et al.  Whole Genome Phylogeny via Complete Composition Vectors , 2005 .