Matching product titles using web-based enrichment

Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.

[1]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[2]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[3]  Andreas Thor,et al.  Tailoring entity resolution for matching product offers , 2012, EDBT '12.

[4]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[5]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[7]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Rodrigo Gonçalves,et al.  Approximate data instance matching: a survey , 2011, Knowledge and Information Systems.

[9]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[10]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[11]  Surajit Chaudhuri,et al.  Mining Document Collections to Facilitate Accurate Approximate Entity Matching , 2009, Proc. VLDB Endow..

[12]  Sugato Basu,et al.  Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[13]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[14]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Ariel Fuxman,et al.  Matching unstructured product offers to structured product specifications , 2011, KDD.

[16]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[18]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[19]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[20]  Surajit Chaudhuri,et al.  Exploiting web search to generate synonyms for entities , 2009, WWW '09.

[21]  Andrei Z. Broder,et al.  Robust classification of rare queries using web knowledge , 2007, SIGIR.

[22]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.