Debugging the Evidence Chain

In Education (as in many other fields) it is common to create complex systems to assess the state of latent properties of individuals — the knowledge, skills, and abilities of the students. Such systems usually consist of several processes including (1) a context determination process which identifies (or creates) tasks—contexts in which evidence can be gathered,—(2) an evidence capture process which records the work product produced by the student interacting with the task, (3) an evidence identification process which captures observable outcome variables believed to have evidentiary value, and (4) an evidence accumulation system which integrates evidence across multiple tasks (contexts), which often can be implemented using a Bayesian network. In such systems, flaws may be present in the conceptualization, identification of requirements or implementation of any one of the processes. In later stages of development, bugs are usually associated with a particular task. Tasks which have exceptionally high or unexpectedly low information associated with their observable variables may be problematic and merit further investigation. This paper identifies individuals with unexpectedly high or low scores and uses weight-of-evidence balance sheets to identify problematic tasks for follow-up. We illustrate these techniques with work on the game Newton's Playground: an educational game designed to assess a student's understanding of qualitative physics.

[1]  Robert J. Mislevy,et al.  Three Things Game Designers Need to Know About Assessment , 2012 .

[2]  E. Muraki A GENERALIZED PARTIAL CREDIT MODEL: APPLICATION OF AN EM ALGORITHM , 1992 .

[3]  Russell G. Almond,et al.  Models for Conditional Probability Tables in Educational Assessment , 2001, AISTATS.

[4]  Russell G. Almond,et al.  Bayesian networks: A teacher's view , 2009, Int. J. Approx. Reason..

[5]  Russell G. Almond,et al.  How Task Features Impact Evidence From Assessments Embedded in Simulations and Games , 2014, Measurement : interdisciplinary research and perspectives.

[6]  Russell G. Almond,et al.  Graphical Explanation in Belief Networks , 1997 .

[7]  Russell G. Almond,et al.  Graphical Models and Computerized Adaptive Testing , 1998 .

[8]  I.,et al.  Weight of Evidence : A Brief Survey , 2006 .

[9]  Howard Wainer,et al.  Augmented Scores-"Borrowing Strength" to Compute Scores Based on Small Numbers ofltems , 2001 .

[10]  Russell G. Almond,et al.  You Can't Fatten A Hog by Weighing It - Or Can You? Evaluating an Assessment for Learning System Called ACED , 2008, Int. J. Artif. Intell. Educ..

[11]  V. Shute,et al.  Melding the Power of Serious Games and Embedded Assessment to Monitor and Foster Learning: Flow and Grow , 2009 .

[12]  A. A. Davier,et al.  Test equating, scaling, and linking. Methods and practices , 2006 .

[13]  Russell G. Almond,et al.  Enhancing the Design and Delivery of Assessment Systems: A Four-Process Architecture , 2002 .

[14]  Russell G. Almond,et al.  "I Can Name that Bayesian Network in Two Matrixes!" , 2007, BMA.

[15]  Kurt VanLehn,et al.  The acquisition of qualitative physics knowledge during textbook-based physics training , 1997 .

[16]  Thomas R. Boucher,et al.  Test Equating, Scaling, and Linking: Methods and Practices , 2007 .

[17]  R. Almond,et al.  Focus Article: On the Structure of Educational Assessments , 2003 .