A DTD Extension for Document Structure Recognition

This paper deals with the representation of document models used in the field of document recognition. A novel formalism called generalized n-gram is presented, which is shown to be accurate for the recognition task and well adapted to automatic learning by examples. The paper addresses also the thorny problem of integrating models for document analysis with existing standards used for document manipulation and production.