Segmentation-Based Urban Traffic Scene Understanding

Recognizing the traffic scene in front of a car is an important asset for autonomous driving, as well as for safety systems. While GPS-based maps abound and have reached an incredible level of accuracy, they can still profit from additional, image-based information. Especially in urban scenarios, GPS reception can be shaky, or the map might not contain the latest detours due to constructions, demonstrations, etc. Furthermore, such maps are static and cannot account for other dynamic traffic agents, such as cars or pedestrians. In this paper, we therefore propose an image-based system that is able to recognize both the road type (straight, left/right curve, crossing, ...) as well as a set of often encountered objects (car, pedestrian, pedestrian crossing). The obtained information could then be fused with existing maps and either assist the driver directly (e.g., a pedestrian crossing is ahead: slow down) or help in improving object tracking (e.g., where are possible entrance points for pedestrians or cars?). Starting from a video sequence obtained from a car driving through urban areas, we employ a two-stage architecture termed SegmentationBased Urban Traffic Scene Understanding (SUTSU) that first builds an intermediate representation of the image based on a patch-wise image classification. The patch-wise segmentation is inspired by recent work [3, 4, 5] and assigns class probabilities to every 8× 8 image patch. As a feature set, we use the coefficients of the Walsh-Hadamard transform (a decomposition of the image into square waves), and, if available, additional information from the depth map. These are then used in a oneversus-all training using AdaBoost for feature selection, where we choose 13 texture classes that we found to be representative of typical urban scenes. This yields a meta representation of the scene that is more suitable for further processing, Fig. 1 (b,c). In recent publications, such a segmentation was used for a variety of purposes, such as improvement of object detection [1, 5], analysis of occlusion boundaries, or 3D reconstruction. In this paper, we will investigate the use of a segmentation for urban scene analysis. We infer another set of features from the segmentation’s probability maps, analyzing repetitivity, curvature, and rough structure. This set is then again used with a one-versus-all training to infer both the type of road segment ahead, as well the additional presence of pedestrians, cars, or pedestrian crossing. A Hidden Markov Model is used for temporally smoothing the result. SUTSU is tested on two challenging sequences, spanning over 50 minutes video of driving through Zurich. The experiments show that while a state-of-the-art scene classifier [2] can keep global classes such as road types, similarly well apart, a manually crafted feature set based on a segmentation clearly outperforms it on object classes. Example images are shown in Fig. 2. The main contribution of this paper is the application of recent research efforts in scene categorization research to do vision “in the wild”, driving through urban scenarios. We furthermore show the advantage of a segmentation-based approach over a global descriptor, as the intermediate representation can easily be adapted to other underlying image data (e.g. dusk, rain, ...), without having to change the high-level classifier.

[1]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3]  Pierre Baylou,et al.  New operators for optimized orientation estimation , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[4]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[5]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[6]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[7]  Daniel P. Huttenlocher,et al.  Efficient Belief Propagation for Early Vision , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8]  Paul A. Viola,et al.  Boosting Image Retrieval , 2004, International Journal of Computer Vision.

[9]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Alexei A. Efros,et al.  Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[11]  Yacov Hel-Or,et al.  Real-time pattern matching using projection kernels , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[13]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[15]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Dariu Gavrila,et al.  Multi-cue Pedestrian Detection and Tracking from a Moving Vehicle , 2007, International Journal of Computer Vision.

[17]  Mohan M. Trivedi,et al.  Video-based lane estimation and tracking for driver assistance: survey, system, and evaluation , 2006, IEEE Transactions on Intelligent Transportation Systems.

[18]  Li Fei-Fei,et al.  Spatially coherent latent topic model for concurrent object segmentation and classification , 2007 .

[19]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[20]  Luc Van Gool,et al.  A mobile vision system for robust multi-person tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Antonio Criminisi,et al.  Object Class Segmentation using Random Forests , 2008, BMVC.

[22]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.