Discovery in the Context of Educational Data Mining : An Inductive Approach

Automated learning environments collect large amounts of information on the activities of their students. Unfortunately, analyzing and interpreting these data manually can be tedious and requires substantial training and skill. Although automatic techniques do exist for mining data, the results are often hard to interpret or incorporate into existing scientific theories of learning and education. We therefore present a model for performing automatic scientific discovery in the context of human learning and education. We demonstrate, using empirical results relating the frequency of student self-assessments to quiz performance, that our framework and techniques yield results better than those available using human-crafted features. Introduction and Previous Work One of the fundamental goals of scientific research is the modeling of phenomena. In particular, we are interested in examining the data produced by students using an on-line course. Intuitively, researchers believe many interesting, and potentially useful trends and patterns are contained in these logs. Researchers usually begin with some general idea of the phenomenon they would like to understand, and then proceed to collect some observations of it. The scientist, for example, might have some prior belief, based on existing scientific theory and intuition, that the amount of time a student spends reading course notes will affect his performance on quizzes, but is not able to specify exactly what he means by “time reading notes.” Is it the cumulative number of minutes, split into any number of sessions, conducted under any condition, prior to the evaluation that matters? Or is the intensity of the reading more important? Is it better to read the notes right before the quiz, for higher recall, or perhaps an earlier viewing helps prime the student for learning? Unfortunately, researchers have neither the time nor patience to go through all these logs, by hand, to find ideal instantiations of their features. In this work, we develop a partial solution to this problem of feature discovery that uses a computer to intelligently induce higher-level features from low-level data. Although computers can produce copious log data, the unstructured, low-level nature of these data unfortunately makes it difficult to design an algorithm that can construct features and models the researcher and his community are interested in and can understand. In fact, the complexity of a full search of the feature space, from a statistical point of view, would depend on the size of the sufficient statistics of the entire data set. Thus, for all real-world problems, brute force search is intractable. A more insidious problem, however, is that even if the space of features were able to be enumerated and searched efficiently, the number of possible models based on those features would be even larger, and any attempt at learning a true model would suffer from overfitting and the curse of dimensionality. Although techniques do exist for addressing this issue, many do not take into consideration the semantics of the features, instead relying on an estimation of complexity. It turns out that by carefully limiting the types of features that we can represent and search, we reduce our search and overfitting problems without, hopefully, cutting out too many expressive features. Certain techniques do exist for addressing feature selection. Principle component analysis (PCA) (e.g. Schölkopf, Smola, and Müller 1998), for example, finds a projection of the data from a higher dimensional space to one with fewer dimensions. This projection reduces the number of features needed to represent the data. Unfortunately, these projections distort the original, presumably intuitive, definition of the features of the data into linear combinations of these features. In this process, much of the interpretability of the resulting models is sacrificed. This weakness is present in many of the other methods used for feature selection and dimensionality reduction, such as clustering and kernel methods (Jain, Duin, and Mao 2000). All suffer from a sacrifice of interpretability which has been shown to be essential if computational techniques should ever have a serious impact on the progress of scientific research (Pazzini, Mani, and Shankle 2001). These are the problems this research tries to solve. Figure 1. Feature creation flow diagram It is important to note that we face two distinct problems, each of which is challenging in its own right. The first is defining and searching through the large space of possible features. The second is constraining that feature space and biasing the search to discover new features that improve predictiveness while still preserving the semantics of the original features. To achieve these goals we begin with a relatively small set of core features, defined in terms of the raw data, and grow this set, through an iterative process of feature creation, scoring, and pruning. At each iteration the predictiveness of the model based on the features is increased, while the scientific and semantic interpretability of the features themselves is hopefully preserved. Via this semi-greedy process of growing and pruning we are able to discover novel, nonintuitive features without the burden of a brute-force search. System architecture (overview) Figure 1 shows a high level diagram of the cyclical flow of data through the feature creation process and provides the outline for the structure of the paper. The arrows represent processes, and the items in italics are the inputs and results of those processes. We begin with the initial features of the raw data, and iteratively grow candidate features via prediction and calculation, and then prune these candidates down to create a new generation to begin the process again. Each completed cycle represents one iteration of the algorithm. The process continues until a user-defined stopping condition is met, e.g. elapsed computation time, R 2 of discovered model, or number of features discovered.