Cloud-MAQ: The cloud-enabled scalable whole genome reference Assembly application

Biology problems are in general NP-hard that demands tremendous resource both in terms of time and computing resources. Most of the computing systems developed for quantifying biological objects suffer from such limitations. MAQ (Mapping and Assembly with Quality) is one such popular bioinformatics system developed for whole genome reference assembly - it is designed to handle the challenges related to short sequence reads generated by Illumina sequencing machines, and can support a maximum read length of 63 nucleotides. MAQ is not multithreaded or many core ready - it runs on single CPU and does not scale. Therefore, as the data size increases, it fails to scale efficiently and requires a supercomputer to perform the assembly within a desired time. In this paper we report Cloud-MAQ that uses the cloud computing paradigm to address the NP-hard related challenges of whole genome reference assembly. Through Hadoop and the cloud paradigm MAQ is made parallel and scalable. Also, MAQ functionality has been enhanced to support recent reads from Illumina that are of 76 nucleotides. This cloud-enabled Cloud-MAQ increases the performance of MAQ reference assembly multi-fold.