Cooler: scalable storage for Hi-C data and other genomically labeled arrays

MOTIVATION Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. RESULTS We developed a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. AVAILABILITY Cooler is cross-platform, BSD-licensed, and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[2]  Jean-Philippe Vert,et al.  HiC-Pro: an optimized and flexible pipeline for Hi-C data processing , 2015, Genome Biology.

[3]  Hans Hagen,et al.  Hierarchical and Geometrical Methods in Scientific Visualization , 2003 .

[4]  Erez Zadok,et al.  Unifying biological image formats with HDF5 , 2009, CACM.

[5]  Jérôme Waldispühl,et al.  Storage, visualization, and navigation of 3D genomics data. , 2018, Methods.

[6]  Harald Barsnes,et al.  BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[7]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[8]  Noam Kaplan,et al.  The Hitchhiker's guide to Hi-C analysis: practical guidelines. , 2015, Methods.

[9]  L. Mirny,et al.  Iterative Correction of Hi-C Data Reveals Hallmarks of Chromosome Organization , 2012, Nature Methods.

[10]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[11]  James O J Davies,et al.  How best to identify chromosomal interactions: a comparison of approaches , 2017, Nature Methods.

[12]  James Taylor,et al.  HiFive: a tool suite for easy and efficient HiC and 5C data analysis , 2014, Genome Biology.

[13]  James T. Robinson,et al.  Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. , 2016, Cell systems.

[14]  L. Chin,et al.  HiCPlotter integrates genomic data with interaction matrices , 2015, Genome Biology.

[15]  Valerio Pascucci,et al.  Hierarchical Indexing for Out-of-Core Access to Multi-Resolution Data , 2003 .

[16]  Jacob M. Luber,et al.  HiGlass: web-based visual exploration and analysis of genome interaction maps , 2017, Genome Biology.

[17]  Renan Valieris,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[18]  Fidel Ramírez,et al.  Galaxy HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization , 2018, Nucleic Acids Res..

[19]  Per Stenberg,et al.  Genome contact map explorer: a platform for the comparison, interactive visualization and analysis of genome contact maps , 2017, Nucleic acids research.

[20]  Wouter de Laat,et al.  The second decade of 3C technologies: detailed insights into nuclear organization , 2016, Genes & development.

[21]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[22]  J. Dekker,et al.  Capturing Chromosome Conformation , 2002, Science.

[23]  Heng Li,et al.  Tabix: fast retrieval of sequence features from generic TAB-delimited files , 2011, Bioinform..