Automatic Website Summarization by Image Content: A Case Study with Logo and Trademark Images

Image-based abstraction (or summarization) of a Web site is the process of extracting the most characteristic (or important) images from it. The criteria for measuring the importance of images in Web sites are based on their frequency of occurrence, characteristics of their content and Web link information. As a case study, this work focuses on logo and trademark images. These are important characteristic signs of corporate Web sites or of products presented there. The proposed method incorporates machine learning for distinguishing logo and trademarks from images of other categories (e.g., landscapes, faces). Because the same logo or trademark may appear many times in various forms within the same Web site, duplicates are detected and only unique logo and trademark images are extracted. These images are then ranked by importance taking frequency of occurrence, image content and Web link information into account. The most important logos and trademarks are finally selected to form the image-based summary of a Web site. Evaluation results of the method on real Web sites are also presented. The method has been implemented and integrated into a fully automated image-based summarization system which is accessible on the Web (www.intelligence.tuc.gr/websummarization)

[1]  Jianying Hu,et al.  Identifying story and preview images in news web pages , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[2]  Andreas Paepcke,et al.  Seeing the whole in parts: text summarization for web browsing on handheld devices , 2001, WWW '01.

[3]  Steven Skiena,et al.  The Algorithm Design Manual , 2020, Texts in Computer Science.

[4]  Kiyoharu Aizawa,et al.  Accuracy enhancement of function-oriented web image classification , 2005, WWW '05.

[5]  Ian H. Witten,et al.  Clustering Documents with Active Learning Using Wikipedia , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[6]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[7]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[8]  Anil K. Jain,et al.  Shape-Based Retrieval: A Case Study With Trademark Image Databases , 1998, Pattern Recognit..

[9]  Mohan S. Kankanhalli,et al.  Content-Based Image Retrieval Using a Composite Color-Shape Approach , 1998, Inf. Process. Manag..

[10]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[11]  Euripides G. M. Petrakis,et al.  Weighted link analysis for logo and trademark image retrieval on the Web , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[12]  Inderjeet Mani Recent developments in text summarization , 2001, CIKM '01.

[13]  Gunther Heidemann,et al.  Unsupervised image categorization , 2005, Image Vis. Comput..

[14]  Shih-Fu Chang,et al.  Detecting image near-duplicate by stochastic attributed relational graph matching with learning , 2004, MULTIMEDIA '04.

[15]  Evangelos E. Milios,et al.  World Wide Web site summarization , 2004, Web Intell. Agent Syst..

[16]  James Ze Wang,et al.  Automatic categorization of figures in scientific documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[17]  Wei-Ying Ma,et al.  Hierarchical clustering of WWW image search results using visual, textual and link information , 2004, MULTIMEDIA '04.

[18]  Daniel P. Lopresti,et al.  Extracting text from WWW images , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.