A Best Match KNN-based Approach for Large-scale Product Categorization

We use K Nearest Neighbors (KNN) classic classification model and the Best Match (BM)25 probabilistic information retrieval model to assess how efficiently the classic KNN model could be modified to solve the real-life product categorizing problem. This paper gives a system description of the KNN-based algorithm for solving the product classification problem. Our submissions experimented are based on the Rakuten 1M product listings datasets in tsv format provided by the Rakuten Institute of Technology Boston. The classification of our KNN algorithm was based on the product title similarity scores generated from the BM25 Information Retrieval Model. With the setting of k=3 in KNN, our proposed program achieved 0.7809, 0.7821, 0.7790 in weighted-{precision, recall and F1 score} respectively in the test dataset.

[1]  Luo Si,et al.  York University at TREC 2007: Genomics Track , 2005, TREC.

[2]  Flavius Frasincar,et al.  Automated product taxonomy mapping in an e-commerce environment , 2015, Expert Syst. Appl..

[3]  Sangun Park,et al.  Ontology Mapping Between Heterogeneous Product Taxonomies in an Electronic Commerce Environment , 2007, Int. J. Electron. Commer..

[4]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[5]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[6]  Dan Shen,et al.  Large-scale item categorization for e-commerce , 2012, CIKM.

[7]  Ben He,et al.  Modeling term proximity for probabilistic information retrieval models , 2011, Inf. Sci..

[8]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[9]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[10]  Xiangji Huang,et al.  Applying Data Mining to Pseudo-Relevance Feedback for High Performance Text Retrieval , 2006, Sixth International Conference on Data Mining (ICDM'06).

[11]  Varun R. Embar,et al.  Aligning Product Categories using Anchor Products , 2018 .

[12]  Stephen E. Robertson,et al.  Okapi at TREC-5 , 1996, TREC.

[13]  Yang Liu,et al.  Combining integrated sampling with SVM ensembles for learning from imbalanced datasets , 2011, Inf. Process. Manag..

[14]  Xiangji Huang,et al.  Mining network data for intrusion detection through combining SVMs with ant colony networks , 2014, Future Gener. Comput. Syst..

[15]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[16]  Stephen E. Robertson,et al.  Overview of the Okapi projects , 1997, J. Documentation.

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Ben He,et al.  CRTER: using cross terms to enhance probabilistic information retrieval , 2011, SIGIR '11.

[19]  Xiangji Huang,et al.  An enhanced context-sensitive proximity model for probabilistic information retrieval , 2014, SIGIR.