A CRF-based approach for web object extraction

A method for extracting Web object is presented in this paper. Firstly, Web object blocks are obtained by blocking the web page and calculating the information entropy of it. Then it uses Conditional Random Field model as a probability and statistics model, and builds a series of feature templates according to the characteristics of objects themselves. Feature functions are generated based on the result of Chinese word segmentation and feature templates. It uses a limited memory BFGS algorithm to estimate parameters of the model, and labels property sequences of Web object blocks by Viterbi algorithm. Experiment result shows that the proposed method is an effective way to extract science data.