A language for document generic layout description and its use for segmentation into regions

We present a segmentation method guided by a generic layout description expressed in a new language. The proposed language allows to describe a page as superposed layers that may be used to separate the main text body from other components, for example figures. The language's novelty resides in the fact that, instead of describing directly the global topology of generic pages according to their regions, generic separators are described and used as region boundary delimiters. Separators may be declared as white spaces or threads. By doing this, the problem of document segmentation into regions has become a problem of separator determination, solved by analyzing lines and white spaces contained in documents.