Damming the genomic data flood using a comprehensive analysis and storage data structure

Data generation, driven by rapid advances in genomic technologies, is fast outpacing our analysis capabilities. Faced with this flood of data, more hardware and software resources are added to accommodate data sets whose structure has not specifically been designed for analysis. This leads to unnecessarily lengthy processing times and excessive data handling and storage costs. Current efforts to address this have centered on developing new indexing schemas and analysis algorithms, whereas the root of the problem lies in the format of the data itself. We have developed a new data structure for storing and analyzing genotype and phenotype data. By leveraging data normalization techniques, database management system capabilities and the use of a novel multi-table, multidimensional database structure we have eliminated the following: (i) unnecessarily large data set size due to high levels of redundancy, (ii) sequential access to these data sets and (iii) common bottlenecks in analysis times. The resulting novel data structure horizontally divides the data to circumvent traditional problems associated with the use of databases for very large genomic data sets. The resulting data set required 86% less disk space and performed analytical calculations 6248 times faster compared to a standard approach without any loss of information. Database URL: http://castor.pharmacogenomics.ca

[1]  Lin Liu,et al.  Building a genome database using an object-oriented approach , 2002, Silico Biol..

[2]  Norman P. Jouppi,et al.  Readings in computer architecture , 2000 .

[3]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[4]  Larry Wall,et al.  Programming Perl , 1991 .

[5]  Judith D. Cohn XML and genomic data , 2000, SIGB.

[6]  Luigi Palopoli,et al.  A Summary of Genomic Databases: Overview and Discussion , 2009, Biomedical Data and Applications.

[7]  Jan-Eric Litton,et al.  Unleashing genotypes in epidemiology - A novel method for managing high throughput information , 2009, J. Biomed. Informatics.

[8]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[9]  George Colliat,et al.  OLAP, relational, and multidimensional database systems , 1996, SGMD.

[10]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[11]  Michael L. Raymer,et al.  Indexing genomic databases , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[12]  R. Oger,et al.  A new versatile database created for geneticists and breeders to link molecular and phenotypic data in perennial crops: the AppleBreed DataBase , 2007, Bioinform..

[13]  Y. Pawitan,et al.  Strategies and issues in the detection of pathway enrichment in genome-wide association studies , 2009, Human Genetics.

[14]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[15]  Joan Hérisson,et al.  Visual data mining of genomic databases by immersive graph-based exploration , 2005, GRAPHITE '05.

[16]  Mohammed J. Zaki,et al.  TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences Using Suffix Trees , 2008, Pacific Symposium on Biocomputing.

[17]  Alex Thomo,et al.  A new method for indexing genomes using on-disk suffix trees , 2008, CIKM '08.

[18]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[19]  Atish P. Sinha,et al.  A comparison of data warehousing methodologies , 2005, CACM.