论文信息 - Addressing Provenance Issues in Big Data Genome Wide Association Studies (GWAS)

Addressing Provenance Issues in Big Data Genome Wide Association Studies (GWAS)

Effective genome wide association studies (GWAS) presents new Big Data challenges for health researchers: data processing delays, data provenance and efficient real-time visualization. This paper presents two recent open source initiatives that, used together, aim at solving these issues. First, an introduction to GWAS is presented followed by a description of the issues faced by the bioinformatics staff at this small health research lab. We then introduce two open source project we initiated: a query engine (QnGene) and a genetic output analysis tool (GOAT) to address these issues and give an overview of their internal architecture and our current experimentation and validation plan.

[1] Minghong Ward,et al. The Database of Short Genetic Variation (dbSNP) , 2014 .

[2] Mark I. McCarthy,et al. Genome-Wide Association Studies of Obesity , 2014 .

[3] Miryung Kim,et al. Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[4] Carole A. Goble,et al. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[5] H. Boezen,et al. Genome-wide association studies: what do they teach us about asthma and chronic obstructive pulmonary disease? , 2009, Proceedings of the American Thoracic Society.

[6] Zhao Zhang,et al. Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[7] J. Marchini,et al. Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[8] David J. Cutler,et al. Discrepancies in dbSNP confirmation rates and allele frequency distributions from varying genotyping error rates and patterns , 2004, Bioinform..

[9] Manuel A. R. Ferreira,et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[10] William Wright,et al. Professional Python Frameworks: Web 2.0 Programming with Django and Turbogears , 2007 .

[11] Alain April,et al. GOAT: Genetic Output Analysis Tool: An open source GWAS and genomic region visualization tool , 2016, Digital Health.

[12] Elizabeth M. Smigielski,et al. dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[13] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14] Margo I. Seltzer,et al. Provenance for the Cloud , 2010, FAST.

[15] Ashraful Hoque,et al. Single nucleotide differences (SNDs) in the dbSNP database may lead to errors in genotyping and haplotyping studies , 2010, Human mutation.

[16] Jennifer Widom,et al. Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[17] Jason H. Moore,et al. Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[18] Aiko Pras,et al. The Network Data Handling War: MySQL vs. NfDump , 2010, EUNICE.

[19] Alain April,et al. QnGene: A Scalable Query Engine Optimized for Analysis of Genomic Data , 2016, Digital Health.