Interaction Between Modules in Learning Systems for Vision Applications

Complex vision tasks such as event detection in a surveillance video can be divided into subtasks such as human detection, tracking, and trajectory analysis. The video can be thought of as being composed of various features. These features can be roughly arranged in a hierarchy from low level features to high-level features. Low-level features include edges and blobs, and high-level features include objects and events. Loosely, the low-level feature extraction is based on signal/image processing techniques, while the high-level feature extraction is based on machine learning techniques. Traditionally, vision systems extract features in a feedforward manner on the hierarchy; that is, certain modules extract low-level features and other modules make use of these low-level features to extract high-level features. Along with others in the research community we have worked on this design approach. We briefly present our work on object recognition and multiperson tracking systems designed with this approach and highlight its advantages and shortcomings. However, our focus is on system design methods that allow tight feedback between the layers of the feature hierarchy, as well as among the high-level modules themselves. We present previous research on systems with feedback and discuss the strengths and limitations of these approaches. This analysis allows us to develop a new framework for designing complex vision systems that allows tight feedback in a hierarchy of features and modules that extract these features using a graphical representation. This new framework is based on factor graphs. It relaxes some of the constraints of the traditional factor graphs and replaces its function nodes by modified versions of some of the modules that have been developed for specific vision tasks. These modules can be easily formulated by slightly modifying modules developed for specific tasks in other vision systems, if we can match the input and output variables to variables in our graphical structure. It also draws inspiration from product of experts and Free Energy view of the EM algorithm. We present experimental results and discuss the path for future development.

[1]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[3]  B. Frey,et al.  Transformation-Invariant Clustering Using the EM Algorithm , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Narendra Ahuja,et al.  Learning Recognition and Segmentation Using the Cresceptron , 1997, International Journal of Computer Vision.

[5]  D. Margaritis Learning Bayesian Network Model Structure from Data , 2003 .

[6]  Ming-Hsuan Yang,et al.  Incremental Learning for Visual Tracking , 2004, NIPS.

[7]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[8]  Takeo Kanade,et al.  Introduction to the Special Section on Video Surveillance , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Christoph Bregler,et al.  Learning and recognizing human dynamics in video sequences , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Mandar Rahurkar,et al.  ROBUST SPEAKER TRACKING BY FUSION OF COMPLEMENTARY FEATURES FROM AUDIO AND VISION MODALITIES , 2004 .

[11]  D. Mumford Pattern theory: a unifying perspective , 1996 .

[12]  Brendan J. Frey,et al.  Transformed hidden Markov models: estimating mixture models of images and inferring spatial transformations in video sequences , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[13]  Song-Chun Zhu,et al.  Modeling Visual Patterns by Integrating Descriptive and Generative Methods , 2004, International Journal of Computer Vision.

[14]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[15]  A. Kundu,et al.  Rotation and Gray Scale Transform Invariant Texture Identification using Wavelet Decomposition and Hidden Markov Model , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Judea Pearl,et al.  Fusion, Propagation, and Structuring in Belief Networks , 1986, Artif. Intell..

[17]  Vladimir Pavlovic,et al.  Audio-visual speaker detection using dynamic Bayesian networks , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[18]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2005, International Journal of Computer Vision.

[19]  Stephen J. Maybank,et al.  Visual Surveillance for Moving Vehicles , 1998, International Journal of Computer Vision.

[20]  Patrick Pérez,et al.  Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[21]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[22]  William T. Freeman,et al.  Nonparametric belief propagation , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[25]  David J. Kriegman,et al.  Curve and Surface Duals and the Recognition of Curved 3D Objects from their Silhouettes , 2004, International Journal of Computer Vision.

[26]  Hong Chen,et al.  A generative model of human hair for hair sketching , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[29]  Pietro Perona,et al.  A Bayesian approach to unsupervised one-shot learning of object categories , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[30]  Parham Aarabi,et al.  Robust sound localization using multi-source audiovisual information fusion , 2001, Inf. Fusion.

[31]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[32]  Alex Pentland,et al.  Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[34]  Rajesh P. N. Rao,et al.  Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. , 1999 .

[35]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[36]  Elie Bienenstock,et al.  Compositionality, MDL Priors, and Object Recognition , 1996, NIPS.

[37]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[38]  Jason Morphett,et al.  An integrated algorithm of incremental and robust PCA , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[39]  Stuart J. Russell,et al.  Adaptive Probabilistic Networks with Hidden Variables , 1997, Machine Learning.

[40]  Simon Blackburn,et al.  The world in your head , 2004 .

[41]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Eric Bauer,et al.  Update Rules for Parameter Estimation in Bayesian Networks , 1997, UAI.

[43]  Ramakant Nevatia,et al.  Event Detection and Analysis from Video Streams , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Anuj Srivastava,et al.  Universal Analytical Forms for Modeling Image Probabilities , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[45]  Thomas S. Huang,et al.  Variable module graphs: a framework for inference and learning in modular vision systems , 2005, IEEE International Conference on Image Processing 2005.

[46]  Hyeonjoon Moon,et al.  The FERET evaluation methodology for face-recognition algorithms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Cordelia Schmid,et al.  On Pencils of Tangent Planes and the Recognition of Smooth 3D Shapes from Silhouettes , 2002, ECCV.

[49]  Max Welling,et al.  Product of experts , 2007, Scholarpedia.

[50]  Svetha Venkatesh,et al.  Recognizing and monitoring high-level behaviors in complex spatial environments , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[51]  R. Cucchiara Multimedia surveillance systems , 2005, VSSN@MM.

[52]  Refractor Vision , 2000, The Lancet.

[53]  Cordelia Schmid,et al.  Scale-invariant shape features for recognition of object categories , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[54]  David C. Hogg,et al.  Learning the Distribution of Object Trajectories for Event Recognition , 1995, BMVC.

[55]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[56]  Fatih Murat Porikli,et al.  Event Detection by Eigenvector Decomposition Using Object and Frame Features , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[57]  David J. Kriegman,et al.  Robust structure and motion from outlines of smooth curved surfaces , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[59]  Hyeonjoon Moon,et al.  The FERET Evaluation Methodology for Face-Recognition Algorithms , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  Brendan J. Frey,et al.  A comparison of algorithms for inference and learning in probabilistic graphical models , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Mei Han,et al.  A detection-based multiple object tracking method , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[62]  Yiannis Aloimonos,et al.  Shape and the Stereo Correspondence Problem , 2005, International Journal of Computer Vision.

[63]  Michael Isard,et al.  Active Contours , 2000, Springer London.

[64]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[65]  Yong Rui,et al.  Better proposal distributions: object tracking using unscented particle filter , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[66]  Tai Sing Lee,et al.  Hierarchical Bayesian inference in the visual cortex. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[67]  David J. Kriegman,et al.  Structure and Motion from Images of Smooth Textureless Objects , 2004, ECCV.

[68]  J. Koenderink,et al.  The singularities of the visual mapping , 1976, Biological Cybernetics.

[69]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[70]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[71]  Dorin Comaniciu,et al.  Real-time tracking of non-rigid objects using mean shift , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).