Automatic web page segmentation and information extraction using conditional random fields

With the rapid development of Internet, Web pages have been more and more complex. Useful information is mixed with a lot of redundant information. In the current Web information extraction systems, manual or semi-manual methods are the majority. To improve the efficiency of information extraction, it requires us to further research the automatic method of Web information extraction. Firstly, we analyze the Web page's basic object according to the Functional-based Object Model. Then we give an automatic method to segment the Web page into semantic blocks using conditional random fields (CRFs). In order to further improve the effect of the semantic block segmentation, combining DOM structure and tree edit distance, the optimization algorithm of the semantic block is given. Finally, we give an automatic Web information extraction tool. Based on this tool, relevant experiments are carried out to evaluate the efficiency of information extraction. Compared to DOM-based Web information extraction systems, the experimental results show the increase in accuracy and recall rate.

[1]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[2]  Gabriel Valiente,et al.  An Efficient Bottom-Up Distance between Trees , 2001, SPIRE.

[3]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[4]  Michael Gertz,et al.  Reverse engineering for Web data: from visual to semantic structures , 2002, Proceedings 18th International Conference on Data Engineering.

[5]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[6]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[7]  Gabriel Alejandro,et al.  Tree edit distance and common subtrees , 2002 .

[8]  E. Schmidt,et al.  Lex—a lexical analyzer generator , 1990 .

[9]  Wei-Ying Ma,et al.  Visual Based Content Understanding towards Web Adaptation , 2002, AH.

[10]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[11]  Jinlin Chen,et al.  Perception-oriented online news extraction , 2008, JCDL '08.

[12]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[13]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[14]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[15]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[16]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[17]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[18]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[21]  Leonid Peshkin,et al.  Bayesian Information Extraction Network , 2003, IJCAI.