There is an increasing need for the development of supportive technology for elderly people living independently in their own homes, as the percentage of elderly people grows. A crucial issue is resolving conflicting goals of providing a technology-assisted safer environment and maintaining the users’ privacy. We address the issue of recognizing ordinary household activities of daily living (ADLs) by exploring different sensing modalities: multi-view computer-vision based silhouette mosaic and radio-frequency identification (RFID)based direct sensors. Multiple sites in our smart home testbed are covered by synchronized cameras with different imaging resolutions. Training behavior models without costly manual labeling is achieved by using RFID sensing. Privacy is maintained by converting the raw image to granular mosaic, while the recognition accuracy is maintained by introducing the multi-view representation of the scene. Advantages of the proposed approach include robust segmentation of objects, viewindependent tracking and representation of objects and persons in 3D space, efficient handling of occlusion, and the recognition of human activity without exposing the actual appearance of the inhabitants. Experimental evaluation shows that recognition accuracy using multi-view silhouette mosaic representation is comparable with the baseline recognition accuracy using RFID-based sensors. Introduction and Research Motivation There is an increasing need for the development of supportive technology for elderly people living independently in their own homes, as the percentage of the elderly population grows. Computer-based recognition of human activities in daily living (ADLs) has gained increasing interest from computer science and medical researchers as the projected care-giving cost is expected to increase dramatically. We have built the Laboratory for Assisted Cognition Environments (LACE) to prototype human activity recognition systems that employ a variety sensors. In this paper, we address the task of recognizing ADLs in a privacy-preserving manner in next-generation smart homes, and present our ongoing research on Assisted Cognition for daily living. Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Our system uses multiple cameras and a wearable RFID reader. The cameras provide multi-scale and multi-view synchronized data, which enables robust visual recognition in the face of occlusions and both large and small scale motions. A short-range, bracelet form factor, RFID reader developed at Intel Research Seattle (iBracelet) remotely transmits time-stamped RFID readings to the vision system’s computer. RFID tags are attached to various objects including furniture, appliances, and utensils around the smart home. Although we currently use commercial-quality cameras and a high-end frame grabber to integrate the video feeds, the decreasing cost of video cameras and the increasing power of multicore personal computers will make it feasible in the near future to deploy our proposed system with inexpensive cameras and an ordinary desktop computer. Previous approaches to recognizing ADLs have depended upon users wearing sensors (RFID and/or accelerometers) (Patterson et al. 2005), audio-visual signals (Oliver et al. 2002) or using a single camera vision system (Abowd et al. 2007; Mihailidis et al. 2007). Recently, (Wu et al. 2007) employed a combination of vision and RFID. The system was able to learn object appearance models using RFID tag information instead of manual labeling. The system is, however, limited by a single camera view, which entails view dependency of the processing. The system also did not attempt to model or learn the motion information involved in performing the ADLs. We propose a multi-sensor based activity recognition system that uses multiple cameras and RFID in a richer way. Understanding human activity can be approached from different levels of detail: for example, a body transitioning across a room at a coarse level, versus hand motions manipulating objects at a detailed level (Aggarwal and Park 2004). Our multi-camera based vision system covers various indoor areas in different viewing resolutions from different perspectives. RFID tags and reader(s) detect without false positives the nearby objects that are handled by the user. The advantages of such a synergistic integration of vision and RFID include robust segmentation of objects, view-independent tracking and representation of objects and persons in 3D space, efficient handling of occlusion, efficient learning of temporal boundary of activities without human intervention, and the recognition of human activity at both a coarse and fine level. System Architecture Overview Figure 1: The overall system architecture of the Multi-scale multi-perspective vision system. Fig. 1 shows the overall system architecture. The clear modules compose the basic single-view system, while the hashed modules compose the multi-view functionality. Activity analysis is performed with the features transformed from the input modules. The planar homography module locates persons for activity analysis. The dashed enclosing boxes indicate that the silhouette and motion-map generation processes could be performed at hardware level with infrared or near-infrared cameras to ensure the privacy. In multi-view mode, the foreground image is redundantly captured to represent the three-dimensional extension of the foreground objects. Using multiple views not only increases robustness, but also supports simple and accurate estimation of view-invariant features such as object location and size. Multiple View Scene Modeling Contrary to single camera systems, our multi-camera system provides view-independent recognition of ADLs. Our vision system is composed of two wide field-of-view (FOV) cameras and two narrow FOV cameras, all synchronized. The two wide FOV cameras monitor the whole testbed and localize persons’ positions in the 3D space based on a calibrationfree homography mapping. The two narrow FOV cameras focus on more detailed human activities of interest (e.g., cooking activities at the kitchen countertop area in our experiments). Currently, four cameras are used (Fig. 2) to capture the scene from different perspectives. Two wide-FOV cameras (in darker color) form the approximately orthogonal viewing axes to capture the overall space (views 1 and 2 in Fig. 3), while the two narrow-FOV cameras (in lighter color) form the approximately orthogonal viewing axes to capture more details of certain focus zones such as the kitchen (views 3 and 4 in Fig. 3.) The four synchronized views are overlaid with a virtual grid to compute scene statistics such as pixel counts in each grid cell. Both the track and body-level analysis can be used for the activity analysis depending upon the analysis tasks. In this paper, we focus on multi-view silhouette mosaic representations of detected foreground objects Figure 2: Multi-camera configuration. Two wide-FOV cameras (in darker color) and two narrow-FOV cameras (in lighter color) form the approximately orthogonal viewing axes, respectively. Figure 3: Distortion-compensated multi-view images 1 to 4 of the ADLs experiment in which a person performs the prepare cereal activity. The virtual grid is overlayed to compute scene statistics. for privacy-preserving recognition of activities. In Fig. 1, dynamic contextual control with optional user involvement can be incorporated with activity analysis, and provides constraints to other processing modules as feedback. The top-down feedback flows in the system are marked as red arrows. Appearance-Based Segmentation and Tracking ADLs may involve multiple objects moving simultaneously (Fig. 3), which can create challenges for a vision system — for example, changing backgrounds and object occlusion. We adopt a dynamic background model using K-means clustering (Kim et al. 2005). The background model is updated with a memory decay factor to adapt to the changes in the background, and foreground-background segmentation is achieved at each pixel. Silhouette representation of the foreground regions may obscure object appearances to preserve privacy (Fig. 4), but it also loses the useful information about object identities. RFID reading provides a smart way for identifying the obFigure 4: Binary foreground maps corresponding to the multi-view images in Fig. 3. The whole image forms the super foreground map Γt. Black areas represent effective foreground regions (i.e., inverted for visualization only.) ject types. The onset and offset of a specific RFID label stream may provide a useful clue for the onset/offset of a certain activity that typically manipulates the corresponding objects. Representation of Scene Statistics Figs. 3 and 4 show the process of representing scene statistics. We denote the m-th camera image and its foreground map at time t as I m and F t m, respectively, (m ∈ {1, 2, 3, 4}, See Fig. 3). A super image θ and its associated super foreground map Γ are obtained by juxtaposing the individual images I 1, . . . , I t 4 and F t 1 , . . . , F t 4 , respectively. Therefore if I m is of sizeW ×H pixels, θ and Γ are of size 2W ×2H . (In our implementation, image width W = 640 pixels and image height H = 480 pixels.) A virtual grid overlays the super foreground map Γ (Fig. 4) for decimation as follows. Each of the grid cells with cell size of S × S pixels (S = 40 pixels) counts the number of foreground pixels (in Fig. 4) within its cell boundary and divides the number with the cell area as follows. δ i,j = ∑ foreground pixels
[1]
Khalid Sayood,et al.
Introduction to Data Compression
,
1996
.
[2]
Henry A. Kautz,et al.
Fine-grained activity recognition by aggregating abstract object usage
,
2005,
Ninth IEEE International Symposium on Wearable Computers (ISWC'05).
[3]
James M. Rehg,et al.
A Scalable Approach to Activity Recognition based on Object Use
,
2007,
2007 IEEE 11th International Conference on Computer Vision.
[4]
Joan Truckenbrod.
IEEE international symposium on wearable computers
,
1998,
COMG.
[5]
Lawrence R. Rabiner,et al.
A tutorial on hidden Markov models and selected applications in speech recognition
,
1989,
Proc. IEEE.
[6]
Eric Horvitz,et al.
Layered representations for human activity recognition
,
2002,
Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.
[7]
Larry S. Davis,et al.
Real-time foreground-background segmentation using codebook model
,
2005,
Real Time Imaging.
[8]
Sangho Park,et al.
Human motion: modeling and recognition of actions and interactions
,
2004
.