DataLab: Introducing Software Engineering Thinking into Data Science Education at Scale

Data science education is a new area in computer science that has attracted increasing attention in recent years. However, currently, data science educators lack good tools and methodologies. In particular, they lack integrated tools through which their students can acquire hands-on software engineering experience. To address these problems, we designed and implemented DataLab, a web-based tool for data science education that integrates code, data and execution management into one system. The goal of DataLab is to provide a hands-on online lab environment to train students to have basic software engineering thinking and habits while maintaining a focus on the core data science contents. In this paper, we present the user-experience design and system-level implementation of DataLab. Further, we evaluate DataLab's performance through an in-classroom use case. Finally, using objective log-based learning behavior analysis and a subjective survey, we demonstrate DataLab's effectiveness.

[1]  Udayan Khurana,et al.  Efficient snapshot retrieval over historical graph data , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[2]  Michael Stonebraker,et al.  Efficient Versioning for Scientific Array Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Clement T. Yu,et al.  Proceedings of the 2006 ACM SIGMOD international conference on Management of data , 2006, SIGMOD 2006.

[4]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[5]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Frank Maurer,et al.  Datathons: An Experience Report of Data Hackathons for Data Science Education , 2016, SIGCSE.

[8]  Aditya G. Parameswaran,et al.  DataHub: Collaborative Data Science & Dataset Version Management at Scale , 2014, CIDR.

[9]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[11]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[12]  Vassilis J. Tsotras,et al.  Comparison of access methods for time-evolving data , 1999, CSUR.

[13]  Geoffrey C. Fox,et al.  Data Science and Online Education , 2015, 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom).

[14]  David R. Karger,et al.  Collaborative Data Analytics with DataHub , 2015, Proc. VLDB Endow..

[15]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[16]  Christina Freytag,et al.  The Definitive Guide To Mongodb The Nosql Database For Cloud And Desktop Computing , 2016 .

[17]  Zachary G. Ives,et al.  ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data , 2005, CIDR.

[18]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[19]  Jayant Madhavan,et al.  Big Data Storytelling Through Interactive Maps , 2012, IEEE Data Eng. Bull..

[20]  Virpi Hotti,et al.  Advanced data analytics education for students and companies , 2014, ITiCSE '14.

[21]  Helen Shen,et al.  Interactive notebooks: Sharing the code , 2014, Nature.

[22]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[23]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[24]  Paul Anderson,et al.  Data science as an undergraduate degree , 2014, SIGCSE '14.