Identification et structuration hié rarchique des titres dans les documents HTML Structuration hié rarchique des titres

In this paper, we describe a method to automatically identify titles within Web pages. Although HTML syntax provides specific tags for titles, they are not always correctly used, and sometimes they do not even appear. We use visual clues like font size or colour provided by Cascading Style Sheets in order to retrieve the title hierarchy. The assumption is that the level of an element in the title hierarchy increases with its visibility. We automatically built a CSS corpus by crawling the Web and used it to learn a Hidden Markov Model which identifies titles and their hierarchy. Primary results give a F-Measure of 0.70 for titles structuring and 0.86 for titles identification.