Understanding Recommender Systems : Experimental Evaluation Challenges

The paper discusses some significant issues in the empirical evaluation of interactive recommender: the role of experiments, the contingent and constructive nature of users’ interaction strategies, and the generalizability of the results. We propose to adopt as the main evaluation goal the construction of a situation-specific account of the user-system behavior, and we suggest applying the context matching approach to cope with the contingencies of the user behavior. To make clear the limitations of high-level single step experimental evaluations, we present a critical analysis of a case study, a pilot evaluation of a travel recommender system (ITR). The examination of this study shows the danger of overlooking the detailed aspects of user-system interaction, and underlines the need to iteratively refine the evaluation hypotheses and design when no detailed model of the user-system behavior is initially available. 1 Understanding Intelligent Interactive Systems 1.1 Current Evaluation Approaches Several methods are used to evaluate the various components of interactive intelligent system in different development stages. The accuracy and the performance of the algorithms are appraised through analytical approaches and off-line empirical tests and simulations, following the tradition of artificial intelligence and machine learning. The interface components are usually tested through a set of HCI techniques, such as heuristic evaluations, cognitive walkthroughs, and verbal (or video) protocols. Specific system functions and interactive decision aids can be analyzed through laboratory experiments. Finally, the whole system is evaluated through experiments, questionnaires, and clickstream analysis (borrowing methods from behavioral 2 Fabio Del Missier and Francesco Ricci sciences and HCI). More recently, the set of techniques has been expanded by web experiments (mainly used for web ‘field’ studies) and cognitive modeling (used in early stages for interaction assessment and in later stages to generate quantitative behavioral predictions). Specific methods have been proposed to deal with adaptive systems (for instance, layered evaluation [1]). The adoption of different evaluation methods in different stages seems to be a useful heuristic [2]. Nonetheless, each method has its own shortcomings. First, some HCI techniques are rather subjective, and can provide only weak indications. Second, setting up well-designed laboratory experiments is quite complex and costly. Third, the off-line tests of the algorithms tell only a part of the story. In some domains, tests with real users could show a quite different picture of the system effectiveness (because of the GUI impact and the user’s behavior). Finally, the results of the experimental evaluations of specific decision aids are not necessarily generalizable to situations in which these aids are used within a real and complex system. Therefore, the divide-and-conquer approach does not necessarily guarantee fair evaluation results. 1.2 The Contingent and Constructive Nature of Interaction Behavior Embodied and situated cognition research has pointed out how the specific environmental and physical constraints can structure and shape cognitive processes. Lave [3] was able to show that arithmetic reasoning during a purchase behavior in the grocery store can be sharply different than reasoning in arithmetic test tasks, and she indicated how everyday cognition relies on environmental constraints. The importance of relatively fine-grained details of information presentation for information seeking and choice has been demonstrated by the behavioral research on information display both in the laboratory [4] and in a real setting [5]. Among the behavioral decision researchers, a shared view is that preferences are often constructed by the decision-maker during the accomplishment of a task and not merely retrieved from memory [6]. The process of preference construction can be deeply affected by many task and context factors, including the response mode (the way in which the preferences are elicitated). In the field of HCI, the interest for embodied and situated cognition is justified by the observation that user interface design can significantly influence cognition, changing the effort level associated to different strategies. Many factors can affect the cognitive strategies: the information acquisition mode and its cost [7], the implementation cost of the operators [8], the cost of error recovery [9], the explicit support for some type of strategy, the availability of a suited external representation, the perceptual salience of the information, and the relative importance of accuracy Understanding Recommender Systems: Experimental Evaluation Challenges 3 maximization vs. effort minimization goals [10]. HCI research has also highlighted how very simple interactive task (i.e., moving the mouse and pressing a button) can be performed using different microstrategies [11]. The selection of different microstrategies can be influenced in subtle ways by apparently minor changes of the interface design and can produce significant timesavings in routine interaction behaviors. Understanding the user-system interaction is a very complex problem if the system is equipped with some kind of user model. The system interface can be considered as the user’s window on the system, and is able to affect her/his representation through what it makes available and the feedback delivered [12]. At the same time, the user interface is the system window on the user, and it affects the system user model via the information gathered and the input collection mode. 1.3 Toward a Situated Approach Given the complexity of the interactive intelligent systems and the contingent and constructive nature of cognition, it is very difficult to properly evaluate the usersystem behavior. This behavior can be affected by many factors, and it is practically infeasible to manipulate each relevant variable in the experiments. Moreover, it cannot be assumed that the user’s behavior will be static: any slight change to the interface or to the system can modify the user’s interaction and choice strategies. From our viewpoint, evaluating a system means trying to understand its behavior as the result of a complex interplay between its functions, the users’ strategies, and the specific aspects of the interface. Therefore, the main goal of our evaluation efforts should be the definition and test of a detailed situation-specific account of the usersystem behavior. This means that our evaluations will have a rather narrow generalization extent, and that we should carry out detailed interaction analyses. Specific evaluation techniques (see sub-section 1.1) should be carefully applied in preliminary tests, before setting up the experimental evaluations of the final system. These early tests can give us some confidence on the proper behavior of some system components, but they do not guarantee that the real system will work properly. For the experimental evaluation of the real system, we suggest adopting the context matching approach to cope with the contingencies of the user’s behavior [6]. For recommender systems, this means that it will be necessary to reproduce the real decision environment in the experimental setting. Therefore, the real system should be tested, with no change in the available information and databases, in the interface, in the support tools, in the algorithms and parameters. Furthermore, to assure external validity, also the evaluation setting and the sample of users should be selected to be representative. In this way, it will be possible to manipulate in a principled way only a few relevant factors, in order to understand their impact on the user-system behavior 4 Fabio Del Missier and Francesco Ricci in the real decision environment. This approach will allow us to formulate correct predictions for the evaluation extent. 2 Evaluation of an Interactive Case Based Recommender To make clear the limitations of high-level single step experimental evaluations, we will present a critical analysis of a case study, a pilot evaluation of our Intelligent Travel Recommendation system (henceforth ITR, 13). This preliminary evaluation was carried out to get some general indications on the system performance and interface, given that the system was still in evolution, the GUI was in a prototypical stage and the case base was not very huge. Furthermore, we did not have any empirically supported model to guide us. Therefore, it seems that this basic experiment could be considered as representative of a typical early evaluation study. The examination of this pilot test shows the danger of overlooking the detailed aspects of user-system interaction, and underlines the need to iteratively refine the evaluation hypothesis and design when no detailed model of the user-system behavior is initially available. We will show that a detailed analysis of the log data, focussed on the user-system interaction, was able to highlight a series of problems and to identify some sub-optimal interaction behaviors. This analysis suggested potential explanations, new hypotheses, and some methodological changes. 2.1 Problem and System Description The main purposes of recommender systems are to suggest interesting products and to provide information support for consumers’ decision processes [14]. These systems are typically embedded in e-commerce web services, and they take into account the user's needs and preferences in order to propose a suited set of products, relying on specific algorithms and on the knowledge already acquired by the system. Recommender system research is therefore mainly focussed on the issues of information overload, lack of knowledge, trade-off optimization, and interaction cost minimization. The two main recommendation approaches are collaborative-filtering and content-based recommendation [15]. We have designed a novel hybrid coll

[1]  John W. Payne,et al.  Measuring Constructed Preferences: Towards a Building Code , 1999 .

[2]  Demetrios G. Sampson,et al.  Layered Evaluation of Adaptive Applications and Services , 2000, AH.

[3]  Gerald L. Lohse,et al.  Organizational Behavior and Human Decision Processes a Comparison of Two Process Tracing Methods for Choice Tasks , 2022 .

[4]  Jakob Nielsen,et al.  Usability engineering , 1997, The Computer Science and Engineering Handbook.

[5]  S. Payne,et al.  The Effects of Operator Implementation Cost on Planfulness of Problem Solving and Learning , 1998, Cognitive Psychology.

[6]  Danilo Fum,et al.  Adaptive Selection of Problem Solving Strategies , 2001 .

[7]  Robin Burke,et al.  Knowledge-based recommender systems , 2000 .

[8]  Kenton O'Hara,et al.  Planning and the user interface: the effects of lockout time and error recovery cost , 1999, Int. J. Hum. Comput. Stud..

[9]  J. E. Russo,et al.  The Value of Unit Price Information , 1977 .

[10]  Gerhard Fischer,et al.  User Modeling in Human–Computer Interaction , 2001, User Modeling and User-Adapted Interaction.

[11]  John Riedl,et al.  E-Commerce Recommendation Applications , 2004, Data Mining and Knowledge Discovery.

[12]  Francesco Ricci,et al.  ITR: A Case-Based Travel Advisory System , 2002, ECCBR.

[13]  Wayne D. Gray,et al.  Milliseconds Matter: an Introduction to Microstrategies and to Their Use in Describing and Predicting Interactive Behavior Milliseconds Matter: an Introduction to Microstrategies and to Their Use in Describing and Predicting Interactive Behavior , 2022 .

[14]  L. Quéré Cognition in Practice , 1996 .

[15]  Don N. Kleinmuntz,et al.  Information Displays and Choice Processes: Differential Effects of Organization, Form, and Sequence , 1994 .