Bisque: Advances in Bioimage Databases

Biological image databases have quickly replaced the personal media collections of individual scientists. Such databases permit objective comparisons, benchmarking, and data-driven science. As these collections have grown using advanced (and automated) imaging tools and microscopes, scientists need high-throughput large-scale statistical analysis of the data. Traditional databases and standalone analysis tools are not suited for image-based scientific endeavors due to subjectivity, non-uniformity and uncertainty of the primary data and their analyses. This paper describes our image-database platform Bisque, which combines flexible data structuring, uncertain data management and high-throughput analysis. In particular, we examine: (i) Management of scientific images and metadata for experimental science where the data model may change from experiment to experiment; (ii) Providing easy provisioning for high-throughput and large-scale image analysis using cluster/cloud resources; (iii) Strategies for managing uncertainty in measurement and analysis so that important aspects of the data are not prematurely filtered. 1 Challenges for Bioimage Researchers Current research in biology is increasingly dependent on conceptual and quantitative approaches from information sciences, ranging from theory through models to computational tools [6]. Ready availability of new microscopes and imaging techniques has produced vast amounts of multi-dimensional images and metadata. The introduction of new models, measurements, and methods has produced a wealth of data using image-based evidence [24]. Two notable examples of image-based studies are cellular Alzheimer’s studies and plant genetics. In a recent Alzheimer’s study, the ability to reliably detect nuclei in three dimensions was critical to quantitative analysis [25]. The use of nuclei detection also finds use in a wide range of applications such as the accurate determination of how an organism is perturbed by genetic mutations, treatment with drugs, or by injury. Additionally, nuclei centroid locations can be used for further analysis, such as cell membrane segmentation or to initialize and validate computational models of cells and their patterns. In the plant domain, technologies for quantifying plant development are underdeveloped relative to technologies for studying and altering genomes. As a result, information about plant gene function inherent in mutant phenotypes or natural genetic variation remains hidden. For example, plant scientists are trying to uncover gene function by studying seedling growth and development using high throughput image analysis [19]. In both cases, researchers dependent on images as experimental evidence face the daunting task of managing, analyzing and sharing images in addition to gaining and providing access to analysis methods and results [1]. In Copyright 2012 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

[1]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[2]  B. S. Manjunath,et al.  Silencing of CDK5 Reduces Neurofibrillary Tangles in Transgenic Alzheimer's Mice , 2010, The Journal of Neuroscience.

[3]  Anne E Carpenter Software opens the door to quantitative imaging , 2007, Nature Methods.

[4]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Christian Böhm,et al.  The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[7]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[9]  Hanchuan Peng,et al.  Bioimage informatics: a new area of engineering biology , 2008, Bioinform..

[10]  Ronald M. Summers,et al.  Sharing images , 2003, Nature Methods.

[11]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[12]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[14]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[15]  Erik Brauner,et al.  Informatics and Quantitative Analysis in Biological Imaging , 2003, Science.

[16]  Jason R Swedlow,et al.  To 5D and Beyond: Quantitative Fluorescence Microscopy in the Postgenomic Era , 2002, Traffic.

[17]  Amir H Assadi,et al.  Detection of a Gravitropism Phenotype in glutamate receptor-like 3.3 Mutants of Arabidopsis thaliana Using Machine Vision and Computation , 2010, Genetics.

[18]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[19]  Cesare Pautasso,et al.  Restful web services vs. "big"' web services: making the right architectural decision , 2008, WWW.

[20]  Ambuj K. Singh,et al.  Bisque: a platform for bioimage analysis and management , 2009, Bioinform..

[21]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[22]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[23]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[24]  B. S. Manjunath,et al.  The iPlant Collaborative: Cyberinfrastructure for Plant Biology , 2011, Front. Plant Sci..

[25]  Sumit Sarkar,et al.  A probabilistic relational model and algebra , 1996, TODS.

[26]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[27]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[28]  Ilya G. Goldberg,et al.  Modelling data across labs, genomes, space and time , 2006, Nature Cell Biology.

[29]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[31]  Roger Brent,et al.  A partnership between biology and engineering , 2004, Nature Biotechnology.