Computer Vision Techniques for Man-Machine Interaction

Computer vision offers many new possibilities for making machines aware and responsive to man. Manipulating objects is a natural means by which man operates on and shapes his environment. By tracking the hands of a person manipulating objects, computer vision allows any convenient object, including fingers, to be used as a computer input devices. In the first part of this paper we describe experiments with techniques for watching the hands and recognizing gestures. Vision of the face is an important aspect of human-to-human communication. Computer vision makes it possible to track and recognize faces, detect speech acts, estimate focus of attention and recognize emotions through facial expressions. When combined with real time image processing and active control of camera parameters, these techniques can greatly reduce the communications bandwidth required for video-phone and video-conference communications. In the second part of this paper we describe techniques for detecting, tracking, and interpreting faces. By following the movements of humans within a room or building, computer vision makes it possible to automatically migrate communications and access to information technology. In the final section we mention work on detecting and tracking full body motion. All of these systems use relatively simple techniques to describe the appearance of people in images. Such techniques are easily programmed to operate in real time on widely available personal computers. Each of the techniques has been integrated into a continuously operating system using a reactive architecture. 1 Looking at people: perception for man-machine interaction One of the effects of the continued exponential growth in available computing power has been an exponential decrease in the cost of hardware for real time computer vision. This trend has been accelerated by the recent integration of image acquisition and processing hardware for multi-media applications in personal computers. Lowered cost has meant more wide-spread experimentation in real time computer vision, creating a rapid evolution in robustness and reliability and the development of architectures for integrated vision systems [Crowley et al 1994]. Man-machine interaction provides a fertile applications domain for this technological evolution. The barrier between physical objects (paper, pencils, calculators) and their electronic counterparts limits both the integration of computing into human tasks, and the population willing to adapt to the required input devices. Computer vision, coupled with video projection using low cost devices, makes it possible for a human to use any convenient object, including fingers, as digital input devices. Computer vision can permit a machine to track, identify and watch the face of a user. This offers the possibility of reducing bandwidth for video-telephone applications, for following the attention of a user by tracking his fixation point, and for exploiting facial expression as an additional information channel between man and machine. Traditional computer vision techniques have been oriented toward using contrast contours (edges) to describe polyhedral objects. This approach has proved fragile even for man-made objects in a laboratory environment, and inappropriate for watching deformable non-polyhedric objects such as hands and faces. Thus man-machine interaction requires computer vision scientists to "go back to basics" to design techniques adapted to the problem. The following sections describe experiments with techniques for watching hands and faces. 2 Looking at hands: gesture as an input device Human gesture serves three functional roles [Cadoz 94]: semiotic, ergotic, and epistemic. The semiotic function of gesture is to communicate meaningful information. The structure of a semiotic gesture is conventional and commonly results from shared cultural experience. The good-bye gesture, the American sign language, the operational gestures used to guide airplanes on the ground, and even the vulgar “finger”, each illustrates the semiotic function of gesture. The ergotic function of gesture is associated with the notion of work. It corresponds to the capacity of humans to manipulate the real world, to create artifacts, or to change the state of the environment by “direct manipulation”. Shaping pottery from clay, wiping dust, etc. result from ergotic gestures. The epistemic function of gesture allows humans to learn from the environment through tactile experience. By moving your hand over an object, you appreciate its structure, you may discover the material it is made of, as well as other properties. All three functions may be augmented using an instrument or tool. Examples include a handkerchief for the semiotic good-bye gesture, a turn-table for the ergotic shape-up gesture of pottery, or a dedicated artifact to explore the world. In Human Computer Interaction, gesture has been primarily exploited for its ergotic function: typing on a keyboard, moving a mouse and clicking buttons. The epistemic role of gesture has emerged effectively from pen computing and virtual reality: ergotic gestures applied to an electronic pen, to a data-glove or to a body-suit are transformed into meaningful expressions for the computer system. Special purpose interaction languages have been defined, typically 2-D pen gestures as in the Apple Newton, or 3-D hand gestures to navigate in virtual spaces or to control objects remotely. Mice, data-gloves, and body-suits are “artificial add-on’s” that wire the user down to the computer. They are not real end-user instruments (as a hammer would be), but convenient tricks for computer scientists to sense human gesture. Computer vision can transform ordinary artifacts and even body parts into effective input devices. We are exploring the integration of appearance-based computer vision techniques to non-intrusively observe human gesture in a fast and robust manner. As a working problem, we are studying such techniques in the context of a digital desk [Wellner et al 93]. A digital desk, illustrated in figure 1, combines a projected computer screen with a real physical desk. The projection is easily obtained using an overhead projector and a liquid-crystal "datashow" working with standard overhead projector. A video-camera is set up to watch the workspace such that the surface of the projected image and the surface of the imaged area coincide. The transformation between the projected image and the observed image is a projection between two planes, and thus is easily described by an affine transformation. The workspace is populated by a manipulating "hand" and a number of physical and virtual objects which can be manipulated. Both physical and virtual devices act as tools whose manipulation is a communication channel between the user and the computer. The identity of the object which is manipulated carries a strong semiotic message. The manner in which the object is manipulated provides both semiotic and ergotic information. Virtual objects are generated internally by the system and projected onto the workspace. Examples include cursors and other visual feedback symbols as well as shapes and words which may have meaning to the user. Virtual objects are easily created, but they lack the tactile (epistemic) feedback provided by physical objects. The vision system must be able to detect, track and recognize the user's hands as well as the tools which he manipulates. Tools may be virtual objects, or any convenient physical object which has been previously presented to the system. The system must also be able to extract meaning from the way in which tools are manipulated.