GADBMS: A Framework for Scalable Array Analytics

With the help of advancing technology, the scientific community and data mining community are producing an increasing amount of complex data. This data can be stored in multidimensional arrays and has been known to scale in the petabyte range. An obvious solution is to distribute the data across many nodes and work in parallel. However, optimizing storage for space limitations and access, as well as optimizing in memory execution is not intuitive. Array Database Management Systems (ADBMS) can be used to store these large datasets. This position paper will present an ADBMS supported by the Global Arrays framework that will allow users in both the scientific and data mining communities to efficiently store, access, and operate over large datasets in an easy to use framework we call GADBMS (Global-arrays Array Database Management System).

[1]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[2]  Juntae Kim,et al.  The Anomaly Detection by Using DBSCAN Clustering with Multiple Parameters , 2011, 2011 International Conference on Information Science and Applications.

[3]  Srinivasan Parthasarathy,et al.  Stratification driven placement of complex data: A framework for distributed data analytics , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[5]  Srinivasan Parthasarathy,et al.  A single source k-shortest paths algorithm to infer regulatory pathways in a gene network , 2012, Bioinform..

[6]  Shaozhi Ye,et al.  Distributed PageRank computation based on iterative aggregation-disaggregation methods , 2005, CIKM '05.

[7]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[8]  H.P. Ng,et al.  Medical Image Segmentation Using K-Means Clustering and Improved Watershed Algorithm , 2006, 2006 IEEE Southwest Symposium on Image Analysis and Interpretation.

[9]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[10]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[11]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[12]  Tyler Clemons,et al.  Lossless Entropy based Compression for Medical Images , 2012 .

[13]  Srinivasan Parthasarathy,et al.  Markov clustering of protein interaction networks with improved balance and scalability , 2010, BCB '10.

[14]  Ian Foster,et al.  Disk resident arrays: an array-oriented I/O library for out-of-core computations , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[15]  Srinivasan Parthasarathy,et al.  Scalable graph clustering using stochastic flows: applications to community discovery , 2009, KDD.