A First Course in Data Science

Abstract Data science is a discipline that provides principles, methodology, and guidelines for the analysis of data for tools, values, or insights. Driven by a huge workforce demand, many academic institutions have started to offer degrees in data science, with many at the graduate, and a few at the undergraduate level. Curricula may differ at different institutions, because of varying levels of faculty expertise, and different disciplines (such as mathematics, computer science, and business) in developing the curriculum. The University of Massachusetts Dartmouth started offering degree programs in data science from Fall 2015, at both the undergraduate and the graduate level. Quite a few articles have been published that deal with graduate data science courses, much less so dealing with undergraduate ones. Our discussion will focus on undergraduate course structure and function, and specifically, a first course in data science. Our design of this course centers around a concept called the data science life cycle. That is, we view tasks or steps in the practice of data science as forming a process, consisting of states that indicate how it comes into life, how different tasks in data science depend on or interact with others until the birth of a data product or a conclusion. Naturally, different pieces of the data science life cycle then form individual parts of the course. Details of each piece are filled up by concepts, techniques, or skills that are popular in industry. Consequently, the design of our course is both “principled” and practical. A significant feature of our course philosophy is that, in line with activity theory, the course is based on the use of tools to transform real data to answer strongly motivated questions related to the data.

[1]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[2]  Niels Uldbjerg,et al.  Smoking Cessation Early in Pregnancy and Birth Weight, Length, Head Circumference, and Endothelial Nitric Oxide Synthase Activity in Umbilical and Chorionic Vessels: An Observational Study of Healthy Singleton Pregnancies , 2009, Circulation.

[3]  Nathaniel Lasry,et al.  Comparing Educational Tools Using Activity Theory: Clickers and Flashcards , 2010 .

[4]  Yrjo Engestrom Activity theory as a framework for analyzing and redesigning work , 2000 .

[5]  Hause Lin Data science with R , 2019 .

[6]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[7]  Y. Engeström,et al.  Activity theory and individual and social transformation. , 1999 .

[8]  B. Nardi Context and consciousness: activity theory and human-computer interaction , 1995 .

[9]  C. Wild,et al.  Statistical Thinking in Empirical Enquiry , 1999 .

[10]  Thomas J. Steenburgh,et al.  Motivating Salespeople: What Really Works , 2012, Harvard business review.

[11]  Karen Kafadar Statistics for data science , 2019 .

[12]  John Verzani Using R in Introductory Statistics Courses With the pmg Graphical User Interface , 2008 .

[13]  Arthur M. Langer,et al.  Guide to Software Development , 2012, Springer London.

[14]  Joydeep Ghosh,et al.  Data Clustering Algorithms And Applications , 2013 .

[15]  Telecommunications Board,et al.  Data Science for Undergraduates , 2018 .

[16]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[17]  Etienne Wenger,et al.  Situated Learning: Legitimate Peripheral Participation , 1991 .

[18]  T. Davenport,et al.  Data scientist: the sexiest job of the 21st century. , 2012, Harvard business review.

[19]  Michelle Sisto,et al.  Can You Explain That in Plain English? Making Statistics Group Projects Work in a Multicultural Setting , 2009 .

[20]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[21]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[22]  Deborah Nolan,et al.  Teaching and Learning Data Visualization: Ideas and Assignments , 2015, 1503.00781.

[23]  Donghui Yan,et al.  The Turtleback Diagram for Conditional Probability , 2018, ArXiv.

[24]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[25]  W. Simpson,et al.  A preliminary report on cigarette smoking and the incidence of prematurity. , 1957, American journal of obstetrics and gynecology.

[26]  Herman Chernoff,et al.  The Use of Faces to Represent Points in k- Dimensional Space Graphically , 1973 .

[27]  Scott D. Grimshaw A Framework for Infusing Authentic Data Experiences Within Statistics Courses , 2015 .

[28]  N. F. Talyzina,et al.  The Problem of Activity in the Works of A. N. Leont'ev , 1983 .

[29]  G. Cumming Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis , 2011 .

[30]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[31]  Michael I. Jordan,et al.  Cluster Forests , 2011, Comput. Stat. Data Anal..

[32]  Nicholas J. Horton,et al.  Taking a Chance in the Classroom: Setting the Stage for Data Science: Integration of Data Management Skills in Introductory and Second Courses in Statistics , 2015, ArXiv.

[33]  William S. Cleveland Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics , 2001 .

[34]  Gillian Lancaster,et al.  Statistical Education in the 21st Century: A Review of Challenges, Teaching Innovations and Strategies for Reform , 2012 .

[35]  J. Yerushalmy,et al.  MOTHER'S CIGARETTE SMOKING AND SURVIVAL OF INFANT. , 1964, American journal of obstetrics and gynecology.

[36]  Albert Y. Kim,et al.  OkCupid Data for Introductory Statistics and Data Science Courses , 2015 .

[37]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[38]  Arthur. Langer,et al.  Guide to software development : designing and managing the life cycle , 2011 .

[39]  J. Yerushalmy,et al.  The relationship of parents' cigarette smoking to outcome of pregnancy--implications as to the problem of inferring causation from observed associations. , 1971, American journal of epidemiology.

[40]  Rachel Schutt,et al.  Doing Data Science: Straight Talk from the Frontline , 2013 .

[41]  Ben Baumer,et al.  A Data Science Course for Undergraduates: Thinking With Data , 2015, ArXiv.

[42]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[43]  Megan Mocko,et al.  Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report 2016 , 2016 .

[44]  D. Donoho 50 Years of Data Science , 2017 .

[45]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[46]  Nicholas J. Horton,et al.  Data Science in Statistics Curricula: Preparing Students to “Think with Data” , 2014, 1410.3127.

[47]  A. Wilcox,et al.  On the importance--and the unimportance--of birthweight. , 2001, International journal of epidemiology.

[48]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[49]  Hadley Wickham,et al.  R for Data Science: Import, Tidy, Transform, Visualize, and Model Data , 2014 .

[50]  J. Lave Cognition in Practice: Outdoors: a social anthropology of cognition in practice , 1988 .