Abstract This paper proposes an approach to the vectorization and representation of large-size document images. The approach is based on a modified run-length image representation and line-by-line processing scheme with a limited amount of image line stored in memory. Within this approach fast one-pass algorithms for thinning and transformation of a large-size thinned image in vector form are suggested. A hierarchical data structure for the representation of these images in vector form, which stores in compact form all the needed information about connected components, segments, and feature points, is suggested. The process steps for obtaining this data structure are described. The defects which can exist in the vector representation are extracted and an algorithm for their reduction is given. Experimental results are also shown.