Automatic Visual Concept Training Using Imperfect Cross-Modality Information

In this chapter, we show an autonomous learning scheme to automatically build visual semantic concept models from video sequences or the searched data of Internet search engines without any manual labeling work. First of all, system users specify some specific concept models to be learned automatically. Example videos or images can be obtained from the large video databases based on the result of keyword search on the automatic speech recognition transcripts. Another alternative method is to gather them by using the Internet search engines. Then, we propose to model the searched results as a term of “Quasi-Positive Bags” in the Multiple-Instance Learning (MIL). We call this as the generalized MIL (GMIL). In some of the scenarios, there is also no “Negative Bags” in the GMIL. We propose an algorithm called “Bag K-Means” to find out the maximum Diverse Density (DD) without the existence of negative bags. A cost function is found as K-Means with special “Bag Distance”. We also show a solution called “Uncertain Labeling Density” (ULD) which describes the target density distribution of instances in the case of quasi-positive bags. A “Bag Fuzzy K-Means” is presented to get the maximum of ULD. Utilizing this generalized MIL with ULD framework, the model for a particular concept can then be learned through general supervised learning methods. Experiments show that our algorithm get correct models for the concepts we are interested in.

[1]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[2]  Oded Maron,et al.  Learning from Ambiguity , 1998 .

[3]  John R. Smith,et al.  VideoAL: a novel end-to-end MPEG-7 video automatic labeling system , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[4]  Ching-Yung Lin,et al.  Cross-Modality Automatic Face Model Training from Large Video Databases , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[5]  Rainer Lienhart,et al.  Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection , 2003, DAGM-Symposium.

[6]  Rajesh N. Davé,et al.  Robust clustering methods: a unified view , 1997, IEEE Trans. Fuzzy Syst..

[7]  A. Schneider Weighted possibilistic c-means clustering algorithms , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[8]  Thomas Hofmann,et al.  Multiple instance learning with generalized support vector machines , 2002, AAAI/IAAI.

[9]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[10]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[11]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[12]  James M. Rehg,et al.  Statistical Color Models with Application to Skin Detection , 2004, International Journal of Computer Vision.

[13]  Ching-Yung Lin,et al.  Video Collaborative Annotation Forum: Establishing Ground-Truth Labels on Large Multimedia Datasets , 2003, TRECVID.

[14]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[15]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[16]  Sally A. Goldman,et al.  Multiple-Instance Learning of Real-Valued Data , 2001, J. Mach. Learn. Res..

[17]  Ching-Yung Lin,et al.  Autonomous visual model building based on image crawling through internet search engines , 2004, MIR '04.