A Clustering-Based Combinatorial Approach to Unsupervised Matching of Product Titles

The constant growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employed external data sources (search engines) to enrich the titles; these solutions are rather impractical mainly because the external data fetching is slow. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles. UPM is independent of any external sources, since it analyzes the titles and extracts combinations of words out of them. These combinations are evaluated according to several criteria, and the most appropriate of them constitutes the cluster where a product is classified into. UPM is also parameter-free, it avoids product pairwise comparisons, and includes a post-processing verification stage which corrects the erroneous matches. The experimental evaluation of UPM demonstrated its superiority against the state-of-the-art approaches in terms of both efficiency and effectiveness.

[1]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[2]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[3]  Stephen E. Robertson,et al.  Field-Weighted XML Retrieval Based on BM25 , 2005, INEX.

[4]  Rajeev Rastogi,et al.  Matching product titles using web-based enrichment , 2012, CIKM.

[5]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[6]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[7]  Jiaheng Lu,et al.  String similarity measures and joins with synonyms , 2013, SIGMOD '13.

[8]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[9]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[10]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[11]  Raghu Ramakrishnan,et al.  Source-aware Entity Matching: A Compositional Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Rohini K. Srihari,et al.  Matching Titles with Cross Title Web-Search Enrichment and Community Detection , 2014, Proc. VLDB Endow..

[13]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[14]  Panayiotis Bozanis,et al.  Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations , 2018, 2018 Innovations in Intelligent Systems and Applications (INISTA).

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.