Cloud Kotta: Enabling secure and scalable data analytics in the cloud

Distributed communities of researchers rely increasingly on valuable, proprietary, or sensitive datasets. Given the growth of such data, especially in fields new to data-driven research like the social sciences and humanities, coupled with what are often strict and complex data-use agreements, many research communities now require methods that allow secure, scalable and cost-effective storage and analysis. Here we present Cloud Kotta: a cloud-based data management and analytics framework. Cloud Kotta delivers an end-to-end solution for coordinating secure access to large datasets, and an execution model that provides both automated infrastructure scaling and support for executing analytics near to the data. Cloud Kotta implements a fine-grained security model ensuring that only authorized users may access, analyze, and download protected data. It also implements automated methods for acquiring and configuring low-cost storage and compute resources as they are needed. We present the architecture and implementation of Cloud Kotta and demonstrate the advantages it provides in terms of increased performance and flexibility. We show that Cloud Kotta's elastic provisioning model can reduce costs by up to 16x when compared with statically provisioned models.

[1]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[2]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[3]  Nancy Wilkins-Diehr,et al.  Special Issue: Science Gateways—Common Community Interfaces to Grid Resources , 2007, Concurr. Comput. Pract. Exp..

[4]  Douglas Thain,et al.  Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids , 2012, SWEET '12.

[5]  Wilkins-DiehrNancy Special Issue: Science GatewaysCommon Community Interfaces to Grid Resources , 2007 .

[6]  Yadu N. Babuji,et al.  A secure data enclave and analytics platform for social scientists , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[7]  Jacob G. Foster,et al.  Weaving the fabric of science: Dynamic network models of science's unfolding structure , 2015, Soc. Networks.

[8]  Ian T. Foster,et al.  Cost-Aware Elastic Cloud Provisioning for Scientific Workloads , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[9]  Anton Nekrutenko,et al.  Harnessing cloud computing with Galaxy Cloud , 2011, Nature Biotechnology.

[10]  Steven Abramson,et al.  A Hybrid Cloud Architecture for a Social Science Research Computing Data Center , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW).

[11]  Alex Rodriguez,et al.  The Globus Galaxies platform: delivering science gateways as a service , 2015, Concurr. Comput. Pract. Exp..

[12]  Blesson Varghese,et al.  BigExcel: A web-based framework for exploring big data in social sciences , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[13]  Atul Prakash,et al.  Cloud computing data capsules for non-consumptiveuse of texts , 2014, ScienceCloud '14.

[14]  Jingwei Zhang,et al.  Fast, Flexible Models for Discovering Topic Correlation across Weakly-Related Collections , 2015, EMNLP.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[17]  Shaowen Wang,et al.  CyberGIS Gateway for enabling data-rich geospatial research and education , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[18]  Wenjun Wu,et al.  Creating a Cloud-based Life Science Gateway , 2011, 2011 IEEE Seventh International Conference on eScience.

[19]  Eamon Duede,et al.  Proposing Ties in a Dense Hypergraph of Academics , 2015, SocInfo.

[20]  Bryan Ng,et al.  An Automated Tool Profiling Service for the Cloud , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[21]  Daniel C. Stanzione,et al.  The iPlant Collaborative: Cyberinfrastructure to Feed the World , 2011, Computer.

[22]  Ian T. Foster,et al.  Cost-Aware Cloud Provisioning , 2015, 2015 IEEE 11th International Conference on e-Science.

[23]  Eamon Duede,et al.  Amplifying the impact of open access: Wikipedia and the diffusion of science , 2015, J. Assoc. Inf. Sci. Technol..

[24]  Katy Börner,et al.  Plug-and-play macroscopes , 2011, Commun. ACM.