D4M and Large Array Databases for Management and Analysis of Large Biomedical Imaging Data

Advances in medical imaging technologies have enabled the acquisition of increasingly large datasets. Current state-of-the-art confocal or multi-photon imaging technology can produce biomedical datasets in excess of 1 TB per dataset. Typical approaches for analyzing large datasets rely on downsampling the original datasets or leveraging distributed computing resources where small subsets of images are processed independently. These approaches require significant overhead on the part of the programmer to load the desired sub-volume from an array of image files into memory. Databases are well suited for indexing and retrieving components of very large datasets and show significant promise for the analysis of 3D volumetric images. In particular, array-based databases such as SciDB utilize an architecture that supports massive parallel processing while also providing database services such as data management and fast parallel queries. In this paper, we will present a new set of tools that leverage the D4M (Dynamic Distributed Dimensional Data Model) toolbox for analyzing giga-voxel biomedical datasets. By combining SciDB and the D4M toolbox, we demonstrate that we can access large volumetric data and perform large-scale bioinformatics analytics efficiently and interactively. We show that it is possible to achieve an ingest rate of 2.8 million entries per second for importing large datasets into SciDB. These tools provide more efficient ways to access random sub-volumes of massive datasets and to process the information that typically cannot be loaded into memory. This work describes the D4M and SciDB tools that we developed and presents the initial performance results.

[1]  Jeremy Kepner,et al.  'pMATLAB Parallel MATLAB Library' , 2007, Int. J. High Perform. Comput. Appl..

[2]  K. Deisseroth,et al.  CLARITY for mapping the nervous system , 2013, Nature Methods.

[3]  Erik Brauner,et al.  Informatics and Quantitative Analysis in Biological Imaging , 2003, Science.

[4]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[5]  Hanchuan Peng,et al.  V3D enables real-time 3D visualization and quantitative analysis of large-scale biological image data sets , 2010, Nature Biotechnology.

[6]  Jeremy Kepner,et al.  D4M 2.0 schema: A general purpose high performance schema for the Accumulo database , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[7]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[8]  Kevin W Eliceiri,et al.  NIH Image to ImageJ: 25 years of image analysis , 2012, Nature Methods.

[9]  Jeremy Kepner,et al.  D4M: Bringing associative arrays to database engines , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[10]  Jeremy Kepner,et al.  Dynamic distributed dimensional data model (D4M) database and computation system , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Michael Stonebraker,et al.  SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.