Representation of image content is an important part of image annotation and retrieval, which has become a hot topic in computer vision recently. In order to represent image content more efficiently and accurately, grid based methods such as bag-of-words, were proposed and attracted more attention. However, how to segment images and which scale is the best one are still open problems. In this paper, we segment images into grids in several different scales and extract low-level features from every grid, then visual words are constructed by clustering. Through comparing and analyzing the experiment results based on Corel SK, we study the influence of multi-scale segmentation on representation of image content. Our research is helpful for choosing appropriate scales in different situations where image contents are represented, and meaningful to image classification and retrieval.