Neural networks for machine vision: learning three-dimensional object representations

This is a study of machine vision techniques for learning and recognizing three-dimensional (3D) objects. Starting with guidelines from neurobiology, psychophysics, neural network modeling, and engineering systems implementation, a heterogeneous system of modular neural networks is developed. The visual input consists of grey-scale video of complex objects moving before stationary backgrounds. The system generates attentional cues from the imagery which can be used by the viewing camera to fixate objects serially via saccadic motions, or to maintain an object at the center of the visual field via pursuit movements. The final outputs are separate (but unlabeled) signals for each object. All control and error signals are generated internally, allowing continuous training, even during use. The problem is divided as follows: feature extraction, diffusion for part/object attentional focusing and positionally invariant processing, log-polar mapping and mapped-feature diffusion for scale and rotational invariance, 2D shape encoding for foreshortening invariance, categorization for learning and recognizing 2D-invariant views, view transition detection for learning and recognizing 3D objects. An integral part of the system is the Diffusion-Enhancement Bilayer neural network model proposed in this thesis. The same diffusion mechanism is employed in several places to perform distinct feature-clustering and centroid-determination functions. The result of the early processing is a sequence of view categories, called aspects, which represent characteristic views of the objects. The aspects become the input to a new 3D object representation network, called the Aspect Network, introduced in this thesis. The Aspect Network architecture is constructed around adaptive axo-axo-dendritic synapses, and is based on Koenderink's concept of an Aspect Graph. From a sequence of views, the Aspect Network learns the transitions between characteristic views, crystallizing a graph-like structure for each object from an initially amorphous network. Object recognition emerges by accumulating evidence over single and multiple views which activate competing object hypotheses. The PIPE video-rate-hardware/Connection-Machine implementation provides conclusions on the fitness of the individual modules for their tasks, the plausibility of the proposed problem-decomposition, and the feasibility of continuous-time neural dynamics in artificial vision systems.