Methods and equipment for generating and maintaining web content extraction template
暂无分享,去创建一个
The invention provides methods and equipment for generating and maintaining a web content extraction template. The equipment for generating the web content extraction template comprises an input unit, a weight calculation unit, a maximum alignment relationship calculation unit, a combination unit, a determination unit and a selection unit, wherein the weight calculation unit is configured to calculate weights of nodes of each type in each input tree. The equipment for maintaining the web content extraction template comprises a similarity calculation unit, a statistic calculation unit, a statistic judgment unit and a recalculation unit, wherein the similarity calculation unit calculates a similarity sequence; the statistic calculation unit traverses the similarity sequence by utilizing a window with a predetermined size and calculates statistic in the window; and the statistic judgment unit judges whether the web content extraction template is adapted to the input of a web or not according to the calculated statistic. In the methods and the equipment, the web content extraction template can be automatically generated with high efficiency, and when the web changes to cause the invalidation of the extraction template or reduction in accuracy, the web content extraction template can be automatically rapidly regenerated.