Detecting session boundaries from Web user logs

Detecting session boundaries on the Web is important for several reasons. Firstly, it is important to establish a common context for various statistics relating to user sessions and frequency of user activities. More specifically, it is important to detect some boundaries in order to group related information together for other applications, such as learning techniques for adaptive search engines. To date, however, the notion of a session on the Web has not been consistently defined, if it at all. The tendency has been to group the log data that has been made available from one user or IP address under the umbrella of one session regardless of the length of time covered by the logs. This tendency lacks a more user oriented view. Our argument is that a session on the Web can be defined as a group of user activities that are related to each other not only through an evolving information need but also through close proximity in time. Thus, we describe and discuss the investigation based on two Web transaction logs (Excite and Altavista) with a view to structuring the activities into sessions or units for subsequent use in user-oriented learning techniques. The paper describes the methodology and the experiments performed followed by results and discussions. The results point to a 10-15 minute threshold between user activities for an appropriate session interval. The implications and limitations of the results as well as differences with traditional IR systems are also discussed.