A rule-based system for document image segmentation

A rule-based system for automatically segmenting a document image into regions of text and nontext is presented. The initial stages of the system perform image enhancement functions such as adaptive thresholding, morphological processing, and skew detection and correction. The image segmentation process consists of smearing the original image via the run length smoothing algorithm, calculating the connected components locations and statistics, and filtering (segmenting) the image based on these statistics. The text regions can be converted (via an optical character reader) to a computer-searchable form, and the nontext regions can be extracted and preserved. The rule-based structure allows easy fine tuning of the algorithmic steps to produce robust rules, to incorporate additional tools (as they become available), and to handle special segmentation needs.<<ETX>>