DataLab: A Version Data Management and Analytics System

One challenge in big data analytics is the lack of tools to manage the complex interactions among code, data and parameters, especially in the common situation where all these factors can change a lot. We present our preliminary experience with DataLab, a system we build to manage the big data workflow. DataLab improves big data analytical workflow in several novel ways. 1) DataLab manages the revision of both code and data in a coherent system, and includes a distributed code execution engine to run users' code; 2) DataLab keeps track of all the data analytics results in a data work flow graph, and is able to compare the code / results between any two versions, making it easier for users to intuitively see the results of their code change; 3) DataLab provides an efficient data management system to separate data from their metadata, allowing efficient preprocessing filters; and 4) DataLab provides a common API so people can build different applications on top of it. We also present our experience of applying a DataLab prototype in a real bioinformatics application.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[3]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[4]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[5]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[6]  Jayant Madhavan,et al.  Big Data Storytelling Through Interactive Maps , 2012, IEEE Data Eng. Bull..

[7]  Siqi Wu,et al.  Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks , 2016, Proceedings of the National Academy of Sciences.

[8]  Jitendra Malik,et al.  Learning a classification model for segmentation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Michael Stonebraker,et al.  Efficient Versioning for Scientific Array Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[11]  David R. Karger,et al.  Collaborative Data Analytics with DataHub , 2015, Proc. VLDB Endow..

[12]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[13]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[14]  Udayan Khurana,et al.  Efficient snapshot retrieval over historical graph data , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  G. Rubin,et al.  Global analysis of patterns of gene expression during Drosophila embryogenesis , 2007, Genome Biology.

[17]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[18]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[19]  Aditya G. Parameswaran,et al.  DataHub: Collaborative Data Science & Dataset Version Management at Scale , 2014, CIDR.

[20]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[21]  Vassilis J. Tsotras,et al.  Comparison of access methods for time-evolving data , 1999, CSUR.

[22]  Zachary G. Ives,et al.  ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data , 2005, CIDR.