Stochastic models for document restructuration

Document (re)structuration consists in mapping documents coming from different sources, with different formats, onto a predefined semi-structured format. This generic problem appears in different applications settings like heterogeneous semi-structured databases querying, peer to peer systems, legacy document conversion, XML information retrieval. In the paper, we define the restructuration problem from a document centric perspective and identify the main problems raised by this new problematic. We then consider two restructuration instances: structuring flat documents and learning the correspondence between structured formats. We propose stochastic models for these two tasks and describe tests on a large XML document collection.