Special issue on 3D representation for object and scene recognition

The ability to interpret the semantics of objects, their individual geometric attributes, and their spatial and functional relationships within complex environments is essential for an intelligent visual system. This capability is extremely valuable in numerous applications such as autonomous vehicle navigation, robot sensing and manipulation, mobile vision, image database indexing, and video surveillance. However, what appears to us as natural may become tremendously difficult for an artificial system. Indeed, in visual recognition, the problem of recognizing generic objects is a highly challenging one. Objects vary in appearance and shape because of intra-class variability as well as various photometric (e.g., illumination) and geometric (e.g., scale, view point, occlusion, etc.) transformations. Largely due to the difficulty of this problem, most of the current research has focused on modeling object classes by assuming that objects are observed under limited view points and, accordingly, has successfully proposed a vast assortment of learning techniques for discovering and classifying 2D visual patterns. But our world is not 2D and we must take into account that objects live in a physical, three-dimensional world. By modeling objects and their relations in 3D, we can provide robustness to changes in viewpoint and pose along with contextual constraints that reflect the underlying structure of real-world scenes. Furthermore, a 3D representation can enable methodologies for recovering object location, orientation, and distance from the observer (we refer to this as the pose of the object). Accurate pose recovery is not only crucial if one wants to interact with the objects in the environment but also a critical ingredient that enables a visual system to perform higher-level tasks such as object manipulation, activity recognition, or human–object interaction understanding. To that end, a number of key questions must be addressed: How can we effectively learn 3D object representations from images or video? To what extent do we need explicit 3D models to deal with viewpoint variations? What level of supervision is required? This special issue features a series of works that try to answer these questions. The main purpose is to surpass the popular and current paradigm in 3D object recognition where objects are represented as collections (or mixtures) of 2D single view models. In these methods, different poses of the same object category result in completely independent models, where neither features nor parts are shared across views. Rather, the goal is to explore representations that capture the intrinsic multiview nature of objects and enable the recovery of basic geometrical attributes of objects in relationship with the observer and the environment. The special issue begins with the work by Chiu et al. Here the idea of constructing a weak three-dimensional model from as few as two views of an object of the target class is explored. The authors use such a model to transform images of objects from one view to arbitrary other views. The approach can be used in conjunction with other state-of-the-art 2D image-based recognition systems and provides a critical tool for reducing the required amount of supervision. The remainder of the special issue collects a number of works based on the concept of representing objects as collections of elements (features, parts, contours) that are connected across views so as to form a unique and coherent model for the object. In the work by Noceti et al., authors propose a technique for recognizing single 3D objects in videos. Authors show that the spatial and temporal coherency of features in the video sequence can be successfully used to build models of the object that are robust to viewpoint variations. In the work of Tamaki et al., objects are represented as a cyclic group to model the appearance change in an image sequence under rotation around an arbitrary axis. The focus is to recover the viewpoint of single object instances. In the work by Thomas et al., relative depth information available for a set of training images is transferred to new images of previously unseen object instances. This way, whenever an object has been recognized, the image effectively gets augmented with depth estimates (or any other type of metadata, like object parts). The last paper of this special issue is presented by Lo and Siebert. In this work, the idea of extending a 2D object representation with depth information provided by range data is explored. The authors introduce an algorithm that extracts robust feature descriptors from 2.5D range images and models objects by constructing histograms of different range surface topology types. These histograms are eventually used for robust object matching. We believe this special issue provides a great opportunity for bringing together experts from multiple areas of computer vision, and we hope it may serve as a starting point for stimulating debate on different methodologies for representing, recognizing, and interpreting objects in the 3D world. Guest Editors Silvio Savarese Department of Electrical and Computer Engineering, University of Michigan, 1301 Beal Ave., Room 4120, Ann Arbor, MI 48109-2122, USA E-mail address: silvio@umich.edu