In recent years, research in novel types of human-computer interaction, for example multi-touch or tangible interfaces, has increased considerably. Although a large number of innovative applications have already been written based on these new input methods, they often have significant deficiencies from a developer’s point of view. Aspects such as configurability, portability and code reuse have been largely overlooked. A prime example for these problems is the topic of gesture recognition. Existing implementations are mostly tied to a certain hardware platform, tightly integrated into user interface libraries and monolithic, with hard coded gesture descriptions. Developers are therefore time and again forced to reimplement crucial application components. To address these drawbacks, we propose a clean separation between user interface and gesture recognition. In this paper, we present a widely applicable, generic specification of gestures which enables the implementation of a hardwareindependent standalone gesture recognition engine for multitouch and tangible interaction. The goal is to allow the developer to focus on the user interface itself instead of on internal components of the application. INTRODUCTION & RELATED WORK More and more researchers and hobbyists have gained access to novel user interfaces in the last few years. Examples include tangible input devices or multitouch surfaces. Consequently, the number of applications being written for these systems is increasing steadily. However, from a developer’s point of view, most of these applications still have drawbacks. Core components such as gesture recognition are usually integrated so tightly with the rest of the application that they are nearly impossible to reuse in a different context. Additionally, most applications were designed to run on a single piece of hardware only. While this approach probably suited the original developers best, it impedes other persons who try to build on this work. In the end, many applications for novel interactive devices are consequently created from scratch, consuming valuable development time. For example, countless applications contain code which tries to discover widely used multi-finger gestures for scaling and rotation, these being prime candidates for a more general approach. Of course, limiting any kind of generic gesture recognition to these few examples alone would not provide a great advantage, as many more gestures exist which the developer might want to include in an hypothetical interface. Therefore, the first step in a general approach to gesture recognition is to design a formal, extensible specification of gestures. From an abstract point of view, the goal is to separate the semantics of a gesture (the intent of the user) from its syntax (the motions executed by the user). While many graphical toolkits such as Qt, GTK+, Swing, Aqua or the Windows User Interface API exist today, all of them have originally been designed with common input devices such as mouse and keyboard in mind. To some extent, issues such as multi-point input, rotation independence or gesture recognition are being addressed in recent versions or extensions of these toolkits. Examples include DiamondSpin [8], the Microsoft Surface SDK or the support for multitouch input in Windows 7 and MacOS X. Nevertheless, all these libraries still do not provide any separation between the syntax and semantics of gestures. When attempting to customize an application on a per-user basis or adapt it to a different type of hardware, this still requires significant changes to internal components of the library or application itself. Some attempts have already been made with respect to recognizing gestures in the input stream as opposed to simply reacting to touch/release events. Several approaches based on DiamondTouch have been presented by Wu et al. [11, 10]. A common aspect of these systems is that gesture recognition still is performed inside the application itself. Some preliminary approaches to separate the recognition of gestures from the end-user part of the application exist [6, 3]. With the exception of Sparsh-UI [5], these systems are not yet beyond the design stage. Sparsh-UI also follows a layered approach with a separate gesture server that is able to recognize some fixed gestures for rotation, scaling etc. independently from the end-user application. However, while being a step towards a more abstracted view of gestures, the crucial aspect of gesture customization has not yet been addressed. A FORMAL SPECIFICATION OF GESTURES Before discussing the details of our approach, some necessary prerequisites need to be described first. We assume that the raw input data which is generated by the input hardware has already been transformed into an abstract representation such as the popular TUIO protocol [7]. We also assume that the location data delivered by this abstract protocol has been transformed into a common reference frame, e.g., screen coordinates. These assumptions should serve to hide any hardwarerelated differences from the gesture recognizer. Below, we will refer to data generated by the hardware as input events. Usually, every input object (e.g., hand, finger or tangible object) generates one input event for every frame of sensor data in which it is present. More details on the layered software architecture on which this approach is based can also be found in [2]. The formal specification which we will describe here forms the basis for a communications protocol. This protocol is used by the application to specify screen areas and the gestures which are to be recognized within them. Afterwards, the gesture recognizer will use the same protocol to notify the application when one of the previously specified gestures has been triggered by the users’ motions. Widgets and Event Handling Before discussing the specification of gestures and events, we will briefly examine how widgets and events are handled in common mouse-based toolkits. Here, every widget which is part of the user interface corresponds to a window. While this term is mostly applied only to top-level application windows, every tiny widget is associated with a window ID. In this context, a window is simply a rectangular, axis-aligned area in screen coordinates which is able to receive events and which can be nested within another window. Due to this parent-child relationship between windows, they are usually stored in a tree. Should a new mouse event occur at a specific location, this tree is traversed starting from the root window which usually spans the entire screen. Every window is checked whether it contains the event’s location and whether its filters match the event’s type. If both conditions are met, the check is repeated for the children of this window until the most deeply nested window is found which matches this event. The event is then delivered to the event handler of this window. This process is called event capture. However, there are occasions where this window will not handle the event. One such occasion is, e.g., a round button. Events which are located inside the rectangular window, but outside the circular button area itself should have been delivered to the parent instead. In this case, the button’s event handler will reject the event, thereby triggering a process called event bubbling. The event will now be successively delivered to all parent windows, starting with the direct parent, until one of them accepts and handles the event. Should the event reach the root of the tree without having been accepted by any window, it is discarded. When we now compare this commonly used method to our approach, one fundamental difference is apparent. Instead of one single class of event, we are dealing with two semantically different kinds of events. The first class is comprised of input events which describe raw Region Region Gesture Gesture Gesture Gesture Feature Feature Feature Feature "move" "tap" "rotate" "spin" Motion BlobCount Rotation Scale ...
[1]
Hiroshi Ishii,et al.
Bricks: laying the foundations for graspable user interfaces
,
1995,
CHI '95.
[2]
Gudrun Klinker,et al.
A multitouch software architecture
,
2008,
NordiCHI.
[3]
Mike Wu,et al.
Gesture registration, relaxation, and reuse for multi-point direct-touch surfaces
,
2006,
First IEEE International Workshop on Horizontal Interactive Human-Computer Systems (TABLETOP '06).
[4]
Yang Li,et al.
Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes
,
2007,
UIST.
[5]
Enrico Costanza,et al.
TUIO: A Protocol for Table-Top Tangible User Interfaces
,
2005
.
[6]
Meredith Ringel Morris,et al.
DiamondSpin: an extensible toolkit for around-the-table interaction
,
2004,
CHI.
[7]
Mike Wu,et al.
Multi-finger and whole hand gestural interaction techniques for multi-user tabletop displays
,
2003,
UIST '03.