A Data Type-Driven Property Alignment Framework for Product Duplicate Detection on the Web

During the last decade daily life has morphed into a world of broadband ubiquity, where devices facilitate constant engagement. As a consequence of this, the area of e-commerce has seen an immense growth. Despite the market opportunities for retailers and the ease for customers to acquire products through webshops, the shift to digital retail has its drawbacks. For example, it leads to cluttered and incomparable information among different webshops, which calls for an automated method to regain homogeneity in product representations. This paper presents a product duplicate detection solution, which exploits a data type-driven property alignment framework. Based on the performed experiment, we show a statistically significant improvement of the F\(_1\)-score from 47.91 % to 78.13 % compared to an existing state-of-the-art approach.

[1]  Flavius Frasincar,et al.  A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection , 2013, CAiSE.

[2]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[3]  Jure Lescovek Finding Similar Items , 2012 .

[4]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5]  Edward A. Fox,et al.  Research Contributions , 2014 .

[6]  Flavius Frasincar,et al.  FLOPPIES: A Framework for Large-Scale Ontology Population of Product Information from Tabular Data in E-commerce Stores , 2014, Decis. Support Syst..

[7]  Flavius Frasincar,et al.  Multi-component similarity method for web product duplicate detection , 2015, SAC.

[8]  Flavius Frasincar,et al.  A semantic-based approach for searching and browsing tag spaces , 2012, Decis. Support Syst..

[9]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[10]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[11]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[12]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[13]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[14]  Flavius Frasincar,et al.  Faceted product search powered by the Semantic Web , 2012, Decis. Support Syst..