A Trained System for Multimodal Perception in Urban Environments

This paper presents a novel approach to detect and track multiple classes of objects based on the combined information retrieved from camera and laser rangescanner. Laser data points are classified using Conditional Random Fields (CRF) that use a set of multiclass Adaboost classified features. The image detection system is based on Implicit Shape Model (ISM) that learns an appearance codebook of local descriptors from a set of hand-labeled images of pedestrians and uses them in a voting scheme to vote for centers of detected people. We propose several extensions in the training phase in order to automatically create subparts and probabilistic shape templates, and in the testing phase in order to use these extended information to select and discriminate between hypothesis of different classes. Finally the two information are combined during tracking that is based on kalman filters with multiple motion models. Experiments conducted in real-world urban scenarios demonstrate the usefulness of our approach. I. INTRODUCTION Urban environments are complex scenes where often multiple objects interact and move. In order to navigate and understand such environment a robot should be able to detect and track multiple classes of objects: most im- portant pedestrians and cars. The ability to reliably detect these objects in real-world environments is crucial for a wide variety of applications including video surveillance and intelligent driver assistance systems. Pedestrians are partic- ularly difficult to detect because of their high variability in appearance due to clothing, illumination and the fact that the shape characteristics depend on the view point. In addition, occlusions caused by carried items such as backpacks or briefcases, as well as clutter in crowded scenes can render this task even more complex, because they dramatically change the shape of a pedestrian. Cars are large objects that dramatically change their shape with respect to the viewpoint: for example a side view of a car is totally different from its back view. Shape symmetries can easily create false detections and shadows can drive off detection systems. Our goal in this paper is to detect pedestrians and cars and localize them in 3D at any point in time. In particular, we want to provide a position and a motion estimate that can be used in a mobile robotic application. The real- time constraint makes this task particularly difficult and requires faster detection and tracking algorithms than the existing approaches. Our work makes a contribution into this direction. The approach we propose is multimodal in the sense that we use laser range data and images from a camera cooperatively. This has the advantage that both geometrical structure and visual appearance information are available for a more robust detection. Managing detection of multiple classes in laser range data is a complex task due the problem of data segmentation. Often range data is grouped in consistent clusters and then classified, using heuristic rules and therefore creating a strong prior in the algorithm. In this paper, we propose an elegant solution to train and classify range data using Conditional Random Fields (CRF) through the use of a boosted set of features. Moreover each scan point will be labeled with a probability of owning to a certain class. In order to manage occlusions in complex visual scenarios a new extension of the Implicit Shape Model (ISM) for camera data classification has been developed. Finally, each detected object is tracked using a greedy data association method and multiple Extended Kalman Filters that use different motion models. This way, the filter can cope with a variety of different motion patterns for several persons simultaneously. In particular, the major contributions of this work are: • An improved version of the image-based object detector by Leibe et al. (14). It consists in several extensions to the Implicit Shape Model (ISM) in the training step, in the detection step and in the capability of coping with multiple classes. We introduce an automatic subpart extraction that is used to build an improved hypotheses selection, the concept of superfeatures that define a favorable feature selection that maintaining information richness. Moreover we introduce an automatically gen- erated probability template map to ease the multiclass hypothesis selection. • The combined use of Conditional Random Fields and camera detection to track objects in the scene.

[1]  Luc Van Gool,et al.  Dynamic 3D Scene Analysis from a Moving Vehicle , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  R. B. Potts Some generalized order-disorder transformations , 1952, Mathematical Proceedings of the Cambridge Philosophical Society.

[3]  Hugh F. Durrant-Whyte,et al.  CRF-Matching: Conditional Random Fields for Feature-Based Scan Matching , 2007, Robotics: Science and Systems.

[4]  Bernt Schiele,et al.  Pedestrian detection in crowded scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Roland Siegwart,et al.  Human detection using multimodal and multidimensional features , 2008, 2008 IEEE International Conference on Robotics and Automation.

[6]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[7]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[8]  Daniel P. Huttenlocher,et al.  Efficient matching of pictorial structures , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[9]  Dirk Schulz,et al.  A Probabilistic Exemplar Approach to Combine Laser and Vision for Person Tracking , 2006, Robotics: Science and Systems.

[10]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[11]  Henrik I. Christensen,et al.  Tracking for following and passing persons , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Roland Siegwart,et al.  Multimodal People Detection and Tracking in Crowded Scenes , 2008, AAAI.

[13]  António E. Ruano,et al.  Fast Line, Arc/Circle and Leg Detection from Laser Scan Data in a Player Driver , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[14]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[15]  Gunilla Borgefors,et al.  Hierarchical Chamfer Matching: A Parametric Edge Matching Algorithm , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Wolfram Burgard,et al.  Map building with mobile robots in dynamic environments , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[18]  Ben Taskar,et al.  Discriminative learning of Markov random fields for segmentation of 3D scan data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  David A. Forsyth,et al.  Probabilistic Methods for Finding People , 2001, International Journal of Computer Vision.

[20]  Ben J. A. Kröse,et al.  Part based people detection using 2D range data and images , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Wolfram Burgard,et al.  Using Boosted Features for the Detection of People in 2D Range Data , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[22]  Dariu Gavrila,et al.  Real-time object detection for "smart" vehicles , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  Dieter Fox,et al.  Laser and Vision Based Outdoor Object Mapping , 2008, Robotics: Science and Systems.

[25]  Roland Siegwart,et al.  Multimodal detection and tracking of pedestrians in urban environments with explicit ground plane extraction , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[27]  Wolfram Burgard,et al.  People Tracking with Mobile Robots Using Sample-Based Joint Probabilistic Data Association Filters , 2003, Int. J. Robotics Res..