Simultaneous Product Attribute Name and Value Extraction with Adaptively Learnt Templates

If we present the products as the attribute name and value pairs, it will improve the effectiveness of many applications. In this paper, we propose an adaptive template based method to simultaneously extract the product attribute name and value pair from Web pages. The titles of Web pages are used to assist the unsupervised template construction. And the template ranking strategy ensures the correct templates of every Web page are selected. Our approach contains four key steps: 1) construct domain attribute word bag by the titles of Web pages. 2) segment text nodes based on some default delimiters. 3) collect candidate attribute and value pairs 4) learn high-quality templates by a template ranking algorithm. The experimental corpus is collected from two domains: digital camera and mobile phone. Experiments show the precision of 94.68% and recall of 90.57% can be got by our method.

[1]  Marcus Herzog,et al.  Using Ontologies for Extracting Product Features from Web Pages , 2006, SEMWEB.

[2]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[3]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[4]  Bing Liu,et al.  Extracting Web Data Using Instance-Based Learning , 2005, World Wide Web.

[5]  David E. Millard,et al.  Automatic Ontology-Based Knowledge Extraction from Web Documents , 2003, IEEE Intell. Syst..

[6]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[7]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[8]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[9]  Yan Guo,et al.  Simultaneous Product Attribute Name and Value Extraction from Web Pages , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.