论文信息 - Automatic web page segmentation and information extraction using conditional random fields

Automatic web page segmentation and information extraction using conditional random fields

With the rapid development of Internet, Web pages have been more and more complex. Useful information is mixed with a lot of redundant information. In the current Web information extraction systems, manual or semi-manual methods are the majority. To improve the efficiency of information extraction, it requires us to further research the automatic method of Web information extraction. Firstly, we analyze the Web page's basic object according to the Functional-based Object Model. Then we give an automatic method to segment the Web page into semantic blocks using conditional random fields (CRFs). In order to further improve the effect of the semantic block segmentation, combining DOM structure and tree edit distance, the optimization algorithm of the semantic block is given. Finally, we give an automatic Web information extraction tool. Based on this tool, relevant experiments are carried out to evaluate the efficiency of information extraction. Compared to DOM-based Web information extraction systems, the experimental results show the increase in accuracy and recall rate.

Qiang Liu | Yunfei Gong

[1] Bing Liu,et al. Web data extraction based on partial tree alignment , 2005, WWW '05.

[2] Gabriel Valiente,et al. An Efficient Bottom-Up Distance between Trees , 2001, SPIRE.

[3] Kuo-Chung Tai,et al. The Tree-to-Tree Correction Problem , 1979, JACM.

[4] Michael Gertz,et al. Reverse engineering for Web data: from visual to semantic structures , 2002, Proceedings 18th International Conference on Data Engineering.

[5] Line Eikvil,et al. Information Extraction from World Wide Web - A Survey , 1999 .

[6] Baoyao Zhou,et al. Function-based object model towards website adaptation , 2001, WWW '01.

[7] Gabriel Alejandro,et al. Tree edit distance and common subtrees , 2002 .

[8] E. Schmidt,et al. Lex—a lexical analyzer generator , 1990 .

[9] Wei-Ying Ma,et al. Visual Based Content Understanding towards Web Adaptation , 2002, AH.

[10] Alberto H. F. Laender,et al. Automatic web news extraction using tree edit distance , 2004, WWW '04.

[11] Jinlin Chen,et al. Perception-oriented online news extraction , 2008, JCDL '08.