On-Line Learning of Long-Range Obstacle Detection for Off-Road Robots

The method of choice for vision-based driving in off-road mobile robots is to construct a traversibility map of the environment using stereo vision. In the most common approach, a stereo matching algorithm, applied to images from a pair of stereo cameras, produces a “point-cloud”, in which the most visible pixels are given an XYZ position relative to the robot. A traversibility map can then be derived using various heuristics, such as counting the number of points that are above the ground plane in a given map cell. Maps from multiple frames are assembled in a global map in which path finding algorithms are run [2, 3, 1]. The performance of such stereo-based methods is limited, because stereo-based distance estimation is often unreliable above 8 or 10 meters (for typical camera configurations and resolutions). This may cause the system drive as if in a self-imposed “fog”, driving into dead-ends, and taking time to discover distant pathways that are obvious to a human observer. We present two different solutions to the problem of long-range obstacle detection and path planning. Experiments were run on the LAGR robot platform, setting the cameras resolution at 320x240. Method 1: Computing Polar Traversibility Map from Stereo. The practical range of simple stereo-based map building is limited for two reasons: (1) it is difficult to estimate whether far-away points are near the ground or above the ground; (2) the distance estimates are quite inaccurate for points more than 7 or 8 meters away from the camera. To solve problem 1, we estimate the parameters of the ground plane by fitting a plane through the stereo point cloud. Two methods were used: Hough transform on point clouds in elevation, azimuth, disparity space; and EM-like robust plane fitting on point clouds in XYZ space. The traversibility of an area is estimated by measuring the density of points that are above the ground plane in that area. Problem 2 is approached by noting that, while absolute range estimates of distant points are inaccurate, relative range estimates are relatively accurate, and azimuth estimates are very accurate. This suggests that searching for good direction in which to drive the robot is better performed using a map of the visible environment represented in polar coordinates, rather than using a cartesian map of the entire ground. Our system identifies candidate waypoints up to 15 meters away in this local polar map, and uses them as starting points of a path finding algorithm in a global cartesian map. The path finding algorithm is a new approximate A*-like method based on ray casting, dubbed “raystar”. The system drives significantly faster than the LAGR baseline system on various test runs (videos will be shown). Method 2: On-Line Learning from Distance-Normalized Monocular Images. Humans can easily locate pathways from monocular views, e.g. trails in a forest, holes in a row of bushes. Method 2 is an attempt to use on-line learning to provide the same capability to a mobile robot. Although supervised learning can be used for robot driving [4], autonomous learning is far preferable. One long-advocated idea for autonomous learning in robots is to use the output of reliable modules (such as traversibility from stereo at short range) to provide labels for a trainable module (such as a long-range obstacle detector). In one spectacular demonstration of this idea, short-range traversibility data was used to train a mixture of Gaussians model of the RGB color of traversible areas on-the-fly [5]. Our proposed approach, designed for the LAGR robot, builds a distance-invariant pyramid of images at multiple scales, such that the appearance in the image at scale X of an object sitting on the ground X meters away is identical to the appearance in the image of scale Y of the same object when sitting on the ground Y meters away. First, stereo data is used to label the traversibility of visible areas up to 10 meters. Then, the labels are used to train a discriminative classifier at every frame which maps sub-windows in the image pyramid to the computed traversibility at the corresponding location on the ground. The classifier is then applied to images in the pyramid from 10 meters to 30 meters (far beyond stereo range). To build the image pyramid, differently sized sub-images are cropped from the original RGB frame such that each is centered around a specific (imaginary) line on the ground that is a given distance from the robot’s camera. Each extracted sub-image is then subsampled to make it a uniform height (12 pixels), resulting in image bands in which the apperance of an object on the ground is independent of its distance from the camera (only the band in which it appears varies, see figure 1). These uniform-height, variable-width bands form a size-normalized image pyramid whose 20 scales are separated by a factor of 2 1