论文信息 - Layout Based Information Extraction from HTML Documents

Layout Based Information Extraction from HTML Documents

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

Radek Burget

[1] Ian Jacobs,et al. Cascading Style Sheets, level 2 CSS2 Specification , 2008 .

[2] Wei-Ying Ma,et al. Visual Based Content Understanding towards Web Adaptation , 2002, AH.

[3] Wei-Ying Ma,et al. VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[4] Michael Gertz,et al. Reverse engineering for Web data: from visual to semantic structures , 2002, Proceedings 18th International Conference on Data Engineering.

[5] Dayne Freitag,et al. Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[6] Gail E. Kaiser,et al. DOM-based content extraction of HTML documents , 2003, WWW '03.

[7] Yasuto Ishitani,et al. Document transformation system from papers to XML data based on pivot XML document method , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8] Jean-Luc Meunier,et al. Optimized XY-cut for determining a page reading order , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9] Keith L. Clark,et al. Using Grammatical Inference to Automate Information Extraction from the Web , 2001, PKDD.

[10] Baoyao Zhou,et al. Function-based object model towards website adaptation , 2001, WWW '01.