Fractal MapReduce decomposition of sequence alignment

BackgroundThe dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.ResultsIn this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.ConclusionsThe procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing.AvailabilityPublic distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".

[1]  Jonas S. Almeida,et al.  Computing distribution of scale independent motifs in biological sequences , 2006, Algorithms for Molecular Biology.

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  Douglas Crockford JavaScript - the good parts: unearthing the excellence in JavaScript , 2008 .

[4]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[5]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[8]  Guohua Huang,et al.  Alignment-free comparison of genome sequences by a new numerical characterization. , 2011, Journal of theoretical biology.

[9]  Jonas S. Almeida,et al.  Universal sequence map (USM) of arbitrary discrete sequences , 2002, BMC Bioinformatics.

[10]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[11]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[12]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[13]  Jijoy Joseph,et al.  Chaos game representation for comparison of whole genomes , 2006, BMC Bioinformatics.

[14]  P Tufféry,et al.  Exploring an alignment free approach for protein classification and structural class prediction. , 2008, Biochimie.

[15]  R A Khasanov Fractal MapReduce decomposition of sequence alignment , 2013 .

[16]  Jonas S. Almeida,et al.  Efficient Boolean implementation of universal sequence maps (bUSM) , 2002, BMC Bioinformatics.

[17]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[18]  Bernhard Haubold,et al.  Alignment-free estimation of nucleotide diversity , 2011, Bioinform..

[19]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[20]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[21]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[22]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[23]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[24]  John W. Backus,et al.  Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs , 1978, CACM.

[25]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[26]  Jonas S. Almeida,et al.  Biological sequences as pictures – a generic two dimensional solution for iterated maps , 2008, BMC Bioinformatics.