Application of Array-Oriented Scientific Data Formats (NetCDF) to Genotype Data, GWASpi as an Example

Over the last three decades, the power, resolution and sophistication of scientific experiments has vastly increased, allowing the generation of vast volumes of biological data that need to be stored and processed. Array-oriented Scientific Data Formats are part of an effort by diverse scientific communities to solve the increasing problems of data storage and manipulations. Genome-wide Association Studies (GWAS) based on Single Nucleotide Polymorphism (SNP) arrays are one of the technologies that produce large volumes of data, particularly information on genomic variability. Due to the complexity of the methods and software packages available, each with its particular and intricate formats and work-flows, the analysis of GWAS confronts scientists with a complex hardware and software problematic. To help easing these issues, we have introduced the use of Array-oriented Scientific Data Format databases (NetCDF) in the GWASpi application, a user-friendly, multi-platform, desktop-able software for the management and analysis of GWAS data. The achieved leap of performance has permitted to leverage the most out of commonly available desktop hardware, on which GWASpi now enables "start- to-end" GWAS management, from raw data to end results and charts. Not only NetCDF allows storing the data efficiently, but it reduces the time needed to achieve the basic results of a GWAS in up to two orders of magnitude. Additionally, the same principles can be used to store and analyze variability data generated by means of ultrasequencing technologies. Available at http://www.gwaspi.org.