Detecting image purpose in World Wide Web documents

The number of WWW documents available to users of the Internet is growing at an incredible rate. Therefore, it is becoming increasingly important to develop systems that aid users in searching, filtering, and retrieving information from the Internet. Currently, only a few prototype systems catalog and index images in Web documents. To greatly improve the cataloging and indexing of images on the Web, we have developed a prototype rule-based systems that detects the content images in Web documents. Content images are images that are associated with the main content of Web documents, as opposed to a multitude of other images that exist in Web documents for different purposes, such as decorative, advertisement and logo images. We present a system that uses decision tree learning for automated rule induction for the content images detection system. The system uses visual features, text-related features and the document context of images in concert for fast and effective content image detection in Web documents. We have evaluated the system by collecting more than 1200 images from 4 different Web sites and we have achieved an overall classification accuracy of 84 percent.

[1]  John R. Smith,et al.  Transcoding Internet content for heterogeneous client devices , 1998, ISCAS '98. Proceedings of the 1998 IEEE International Symposium on Circuits and Systems (Cat. No.98CH36187).

[2]  Shih-Fu Chang,et al.  Visually Searching the Web for Content , 1997, IEEE Multim..

[3]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[4]  Neil C. Rowe,et al.  Automatic Caption Localization for Photographs on World Wide Web Pages , 1998, Inf. Process. Manag..

[5]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[6]  Joshua R. Smith,et al.  Multi-stage classi cation of images from features and related text , 1997 .

[7]  Eric A. Brewer,et al.  Reducing WWW Latency and Bandwidth Requirements by Real-Time Distillation , 1996, Comput. Networks.

[8]  S. Djorgovski,et al.  From Digitized Images to Online Catalogs: Data Mining a Sky Survey , 1996, AI Mag..

[9]  Antonio Ortega,et al.  Soft caching: web cache management techniques for images , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.