CGDM: collaborative genomic data model for molecular profiling data using NoSQL

MOTIVATION High-throughput molecular profiling has greatly improved patient stratification and mechanistic understanding of diseases. With the increasing amount of data used in translational medicine studies in recent years, there is a need to improve the performance of data warehouses in terms of data retrieval and statistical processing. Both relational and Key Value models have been used for managing molecular profiling data. Key Value models such as SeqWare have been shown to be particularly advantageous in terms of query processing speed for large datasets. However, more improvement can be achieved, particularly through better indexing techniques of the Key Value models, taking advantage of the types of queries which are specific for the high-throughput molecular profiling data. RESULTS In this article, we introduce a Collaborative Genomic Data Model (CGDM), aimed at significantly increasing the query processing speed for the main classes of queries on genomic databases. CGDM creates three Collaborative Global Clustering Index Tables (CGCITs) to solve the velocity and variety issues at the cost of limited extra volume. Several benchmarking experiments were carried out, comparing CGDM implemented on HBase to the traditional SQL data model (TDM) implemented on both HBase and MySQL Cluster, using large publicly available molecular profiling datasets taken from NCBI and HapMap. In the microarray case, CGDM on HBase performed up to 246 times faster than TDM on HBase and 7 times faster than TDM on MySQL Cluster. In single nucleotide polymorphism case, CGDM on HBase outperformed TDM on HBase by up to 351 times and TDM on MySQL Cluster by up to 9 times. AVAILABILITY AND IMPLEMENTATION The CGDM source code is available at https://github.com/evanswang/CGDM. CONTACT y.guo@imperial.ac.uk.

[1]  Juris Rats,et al.  Clustering and Ranked Search for Enterprise Content Management , 2013, Int. J. E Entrepreneurship Innov..

[2]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[3]  M. Suárez-Fariñas,et al.  Based on Molecular Profiling of Gene Expression, Palmoplantar Pustulosis and Palmoplantar Pustular Psoriasis Are Highly Related Diseases that Appear to Be Distinct from Psoriasis Vulgaris , 2016, PloS one.

[4]  Yike Guo,et al.  tranSMART: An Open Source and Community-Driven Informatics and Data Sharing Platform for Clinical and Translational Research , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[5]  Chen Feng,et al.  LCIndex: A Local and Clustering Index on Distributed Ordered Tables for Flexible Multi-dimensional Range Queries , 2015, 2015 44th International Conference on Parallel Processing.

[6]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[7]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[8]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[9]  Yike Guo,et al.  High dimensional biological data retrieval optimization with NoSQL technology , 2014, BMC Genomics.

[10]  Ranjan Sen,et al.  Benchmarking Apache Accumulo BigData Distributed Table Store Using Its Continuous Test Suite , 2013, 2013 IEEE International Congress on Big Data.

[11]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[12]  Curtis E. Dyreson,et al.  Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data , 2014, SIGMOD 2014.

[13]  Cristian Bucur,et al.  A comparison between several NoSQL databases with comments and notes , 2011, 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research.

[14]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[15]  Bruce Momjian,et al.  PostgreSQL: Introduction and Concepts , 2000 .

[16]  Jure Petrovic,et al.  Using Memcached for Data Distribution in Industrial Environment , 2008, Third International Conference on Systems (icons 2008).

[17]  Mc Brown Getting Started with Couchbase Server , 2012 .

[18]  Mohsine Eleuldj,et al.  OpenStack: Toward an Open-source Solution for Cloud Computing , 2012 .

[19]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[20]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[21]  F. Zhan,et al.  Prognostic value of Cyclin D2 mRNA expression in newly diagnosed multiple myeloma treated with high-dose chemotherapy and tandem autologous stem cell transplantations , 2006, Leukemia.

[22]  Brian D. O'Connor,et al.  SeqWare Query Engine: storing and searching sequence data in the cloud , 2010, BMC Bioinformatics.

[23]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[24]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[25]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[26]  Fatos Xhafa,et al.  Learning Structure and Schemas from Documents , 2011, Studies in Computational Intelligence.

[27]  Zhiwei Xu,et al.  CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries , 2010, NPC.

[28]  Nigel Ellis,et al.  Extreme scale with full SQL language support in microsoft SQL Azure , 2010, SIGMOD Conference.

[29]  Josiah L. Carlson,et al.  Redis in Action , 2013 .

[30]  Charles Auffray,et al.  Application of ’omics technologies to biomarker discovery in inflammatory lung diseases , 2013, European Respiratory Journal.

[31]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[32]  D. Cross,et al.  The promise of molecular profiling for cancer identification and treatment. , 2004, Clinical medicine & research.

[33]  Nicoletta Dessì,et al.  Dataspaces: Where Structure and Schema Meet , 2011, Learning Structure and Schemas from Documents.