Addressing Provenance Issues in Big Data Genome Wide Association Studies (GWAS)

Effective genome wide association studies (GWAS) presents new Big Data challenges for health researchers: data processing delays, data provenance and efficient real-time visualization. This paper presents two recent open source initiatives that, used together, aim at solving these issues. First, an introduction to GWAS is presented followed by a description of the issues faced by the bioinformatics staff at this small health research lab. We then introduce two open source project we initiated: a query engine (QnGene) and a genetic output analysis tool (GOAT) to address these issues and give an overview of their internal architecture and our current experimentation and validation plan.

[1]  Minghong Ward,et al.  The Database of Short Genetic Variation (dbSNP) , 2014 .

[2]  Mark I. McCarthy,et al.  Genome-Wide Association Studies of Obesity , 2014 .

[3]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[4]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[5]  H. Boezen,et al.  Genome-wide association studies: what do they teach us about asthma and chronic obstructive pulmonary disease? , 2009, Proceedings of the American Thoracic Society.

[6]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[7]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[8]  David J. Cutler,et al.  Discrepancies in dbSNP confirmation rates and allele frequency distributions from varying genotyping error rates and patterns , 2004, Bioinform..

[9]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[10]  William Wright,et al.  Professional Python Frameworks: Web 2.0 Programming with Django and Turbogears , 2007 .

[11]  Alain April,et al.  GOAT: Genetic Output Analysis Tool: An open source GWAS and genomic region visualization tool , 2016, Digital Health.

[12]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  Margo I. Seltzer,et al.  Provenance for the Cloud , 2010, FAST.

[15]  Ashraful Hoque,et al.  Single nucleotide differences (SNDs) in the dbSNP database may lead to errors in genotyping and haplotyping studies , 2010, Human mutation.

[16]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[17]  Jason H. Moore,et al.  Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[18]  Aiko Pras,et al.  The Network Data Handling War: MySQL vs. NfDump , 2010, EUNICE.

[19]  Alain April,et al.  QnGene: A Scalable Query Engine Optimized for Analysis of Genomic Data , 2016, Digital Health.