Extracting attribute-value pairs from product specifications on the web

Comparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes that are considered relevant by the specific vendor. In addition, product offers might contain structured or semi-structured product specifications in the form of HTML tables and HTML lists. As product specifications often cover more product attributes than free-text descriptions, being able to extract attribute-value pairs from these specifications is a critical prerequisite for achieving good results in tasks such as product matching, product categorisation, faceted product search, and product recommendation. In this paper, we present an approach for extracting attribute-value pairs from product specifications on the Web. We use supervised learning to classify the HTML tables and HTML lists within a web page as product specification or not. In order to extract attribute-value pairs from the HTML fragments identified by the specification detector, we again use supervised learning to classify columns as attribute column or value column. Compared to DEXTER, the current state-of-the-art approach for extracting attribute-value pairs from product specifications, we introduce several new features for specification detection and support the extraction of attribute-value pairs from specifications having more than two columns. This allows us to improve the F-score up to 10% for extracting attribute-value pairs from tables and up to 3% for lists. In addition, we report the results of using duplicate-based schema matching to align the product attribute schemata of 32 different e-shops. This experiment confirms the suitability of duplicate-based schema matching for product data integration.

[1]  Rayid Ghani,et al.  Text mining for product attribute extraction , 2006, SKDD.

[2]  Juliana Freire,et al.  Synthesizing Products for Online Catalogs , 2011, Proc. VLDB Endow..

[3]  Divesh Srivastava,et al.  DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web , 2015, Proc. VLDB Endow..

[4]  Heiko Paulheim,et al.  The Mannheim Search Join Engine , 2015, J. Web Semant..

[5]  Stefano Ortona,et al.  An analysis of duplicate on web extracted objects , 2014, WWW.

[6]  Rohini K. Srihari,et al.  Matching Titles with Cross Title Web-Search Enrichment and Community Detection , 2014, Proc. VLDB Endow..

[7]  Oren Etzioni,et al.  To buy or not to buy: mining airfare data to minimize ticket purchase price , 2003, KDD '03.

[8]  Peter Mika,et al.  Enriching Product Ads with Metadata from HTML Annotations , 2016, ESWC.

[9]  Lidong Bing,et al.  Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer Reviews , 2016, TOIT.

[10]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[11]  Zornitsa Kozareva,et al.  Everyone Likes Shopping! Multi-class Product Categorization for e-Commerce , 2015, NAACL.

[12]  Yulan He,et al.  Connecting Social Media to E-Commerce: Cold-Start Product Recommendation Using Microblogging Information , 2016, IEEE Transactions on Knowledge and Data Engineering.

[13]  Christian Bizer,et al.  The WebDataCommons Microdata, RDFa and Microformat Dataset Series , 2014, International Semantic Web Conference.

[14]  Christian Bizer,et al.  The WDC Gold Standards for Product Feature Extraction and Product Matching , 2016, EC-Web.

[15]  Nemanja Djuric,et al.  E-commerce in Your Inbox: Product Recommendations at Scale , 2015, KDD.

[16]  Flavius Frasincar,et al.  Multi-component similarity method for web product duplicate detection , 2015, SAC.

[17]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.

[18]  Gabor Melli Shallow semantic parsing of product offering titles (for better automatic hyperlink insertion) , 2014, KDD.

[19]  Svetlana Lazebnik,et al.  Where to Buy It: Matching Street Clothing Photos in Online Shops , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Wai Lam,et al.  An unsupervised framework for extracting and normalizing product attributes from multiple web sites , 2008, SIGIR '08.

[21]  Felix Naumann,et al.  Cross-lingual entity matching and infobox alignment in Wikipedia , 2013, Inf. Syst..

[22]  Rajeev Rastogi,et al.  Matching product titles using web-based enrichment , 2012, CIKM.

[23]  Priyanka Hajare Connecting Social Media To e-commerce: Cold-Start Product Recommendation , 2018 .

[24]  Yu Zhou,et al.  Matching User Photos to Online Products with Robust Deep Features , 2016, ICMR.

[25]  Flavius Frasincar,et al.  Faceted product search powered by the Semantic Web , 2012, Decis. Support Syst..

[26]  C. Bizer,et al.  Integrating product data from websites offering microdata markup , 2014, WWW.

[27]  Ariel Fuxman,et al.  Matching unstructured product offers to structured product specifications , 2011, KDD.

[28]  Christian Bizer,et al.  Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata , 2014, LD4IE@ISWC.