What are Textons?

Textons refer to fundamental micro-structures in natural images (and videos) and are considered as the atoms of pre-attentive human visual perception (Julesz, 1981). Unfortunately, the word "texton" remains a vague concept in the literature for lack of a good mathematical model. In this article, we first present a three-level generative image model for learning textons from texture images. In this model, an image is a superposition of a number of image bases selected from an over-complete dictionary including various Gabor and Laplacian of Gaussian functions at various locations, scales, and orientations. These image bases are, in turn, generated by a smaller number of texton elements, selected from a dictionary of textons. By analogy to the waveform-phoneme-word hierarchy in speech, the pixel-base-texton hierarchy presents an increasingly abstract visual description and leads to dimension reduction and variable decoupling. By fitting the generative model to observed images, we can learn the texton dictionary as parameters of the generative model. Then the paper proceeds to study the geometric, dynamic, and photometric structures of the texton representation by further extending the generative model to account for motion and illumination variations. (1) For the geometric structures, a texton consists of a number of image bases with deformable spatial configurations. The geometric structures are learned from static texture images. (2) For the dynamic structures, the motion of a texton is characterized by a Markov chain model in time which sometimes can switch geometric configurations during the movement. We call the moving textons as "motons". The dynamic models are learned using the trajectories of the textons inferred from video sequence. (3) For photometric structures, a texton represents the set of images of a 3D surface element under varying illuminations and is called a "lighton" in this paper. We adopt an illumination-cone representation where a lighton is a texton triplet. For a given light source, a lighton image is generated as a linear sum of the three texton bases. We present a sequence of experiments for learning the geometric, dynamic, and photometric structures from images and videos, and we also present some comparison studies with K-mean clustering, sparse coding, independent component analysis, and transformed component analysis. We shall discuss how general textons can be learned from generic natural images.

[1]  David A. Forsyth,et al.  Shading primitives: finding folds and shallow grooves , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[2]  Jitendra Malik,et al.  Recognizing surfaces using three-dimensional textons , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[3]  D Sagi,et al.  Where practice makes perfect in texture discrimination: evidence for primary visual cortex plasticity. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Junyu Dong,et al.  Capture and Synthesis of 3D Surface Texture , 2005, International Journal of Computer Vision.

[5]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[6]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[7]  Song-Chun Zhu,et al.  Visual learning by integrating descriptive and generative methods , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[8]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Song-Chun Zhu,et al.  Analysis and synthesis of textured motion: particles and waves , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Harry Shum,et al.  Synthesizing bidirectional texture functions for real-world surfaces , 2001, SIGGRAPH.

[11]  Song-Chun Zhu,et al.  What are Textons? , 2002, ECCV.

[12]  Edward H. Adelson,et al.  The perception of shading and reflectance , 1996 .

[13]  Song-Chun Zhu,et al.  Modeling Visual Patterns by Integrating Descriptive and Generative Methods , 2004, International Journal of Computer Vision.

[14]  A. Shashua Geometry and Photometry in 3D Visual Recognition , 1992 .

[15]  Joseph J. Atick,et al.  What Does the Retina Know about Natural Scenes? , 1992, Neural Computation.

[16]  Zhuowen Tu,et al.  Image Segmentation by Data-Driven Markov Chain Monte Carlo , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  S. Ullman,et al.  Geometry and photometry in three-dimensional visual recognition , 1993 .

[18]  David W. Jacobs,et al.  Linear fitting with missing data: applications to structure-from-motion and to characterizing intensity images , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Song-Chun Zhu,et al.  Towards a mathematical theory of primal sketch and sketchability , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[20]  David J. Kriegman,et al.  What Is the Set of Images of an Object Under All Possible Illumination Conditions? , 1998, International Journal of Computer Vision.

[21]  J. Daugman Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Harry Shum,et al.  Motion texture: a two-level statistical model for character motion synthesis , 2002, ACM Trans. Graph..

[24]  Edward H. Adelson,et al.  Shiftable multiscale transforms , 1992, IEEE Trans. Inf. Theory.

[25]  Donald Geman,et al.  Modeling natural microimage statistics , 2000 .

[26]  David J. Kriegman,et al.  The Bas-Relief Ambiguity , 2004, International Journal of Computer Vision.

[27]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[28]  Eero P. Simoncelli,et al.  Image compression via joint statistical characterization in the wavelet domain , 1999, IEEE Trans. Image Process..

[29]  Song-Chun Zhu,et al.  Statistical Modeling and Conceptualization of Visual Patterns , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Song-Chun Zhu,et al.  Prior Learning and Gibbs Reaction-Diffusion , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  F. Kong,et al.  A stochastic approximation algorithm with Markov chain Monte-carlo method for incomplete data estimation problems. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[32]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[33]  Yanxi Liu,et al.  A computational model for repeated pattern perception using frieze and wallpaper groups , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[34]  Martin Vetterli,et al.  Data Compression and Harmonic Analysis , 1998, IEEE Trans. Inf. Theory.

[35]  B. Julesz Textons, the elements of texture perception, and their interactions , 1981, Nature.

[36]  Shree K. Nayar,et al.  Bidirectional Reflection Distribution Function of Thoroughly Pitted Surfaces , 1999, International Journal of Computer Vision.

[37]  Song-Chun Zhu,et al.  A Generative Method for Textured Motion: Analysis and Synthesis , 2002, ECCV.

[38]  Brendan J. Frey,et al.  Transformed component analysis: joint estimation of spatial transformations and image components , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.