Automatic Product Classification Using Supervised Machine Learning Algorithms in Price Statistics

Modern approaches to computing consumer price indices include the use of various data sources, such as web-scraped data or scanner data, which are very large in volume and need special processing techniques. In this paper, we address one of the main problems in the consumer price index calculation, namely the product classification, which cannot be performed manually when using large data sources. Therefore, we conducted an experiment on automatic product classification according to an international classification scheme. We combined 9 different word-embedding techniques with 13 classification methods with the aim of identifying the best combination in terms of the quality of the resultant classification. Because the dataset used in this experiment was significantly imbalanced, we compared these methods not only using the accuracy, F1-score, and AUC, but also using a weighted F1-score that better reflected the overall classification quality. Our experiment showed that logistic regression, support vector machines, and random forests, combined with the FastText skip-gram embedding technique provided the best classification results, with superior values in performance metrics, as compared to other similar studies. An execution time analysis showed that, among the three mentioned methods, logistic regression was the fastest while the random forest recorded a longer execution time. We also provided per-class performance metrics and formulated an error analysis that enabled us to identify methods that could be excluded from the range of choices because they provided less reliable classifications for our purposes.

[1]  Marc K. Francke,et al.  A Machine Learning Approach to Price Indices: Applications in Commercial Real Estate , 2022, The Journal of Real Estate Finance and Economics.

[2]  Fei Huang,et al.  Parallel Instance Query Network for Named Entity Recognition , 2022, ACL.

[3]  Marco A. Palomino,et al.  Automatic Classification of National Health Service Feedback , 2022, Mathematics.

[4]  K. Szafranek,et al.  Nowcasting food inflation with a massive amount of online prices , 2022, International Journal of Forecasting.

[5]  Jiwon Lee,et al.  Spread of E-Commerce, Prices and Inflation Dynamics: Evidence from Online Price Big Data in Korea , 2022, Journal of Asian Economics.

[6]  A. C. Rao,et al.  A survey on sentiment analysis methods, applications, and challenges , 2022, Artificial Intelligence Review.

[7]  H. Trivedi,et al.  Automatic Classification of Cancer Pathology Reports: A Systematic Review , 2022, Journal of pathology informatics.

[8]  Donghong Ji,et al.  Unified Named Entity Recognition as Word-Word Relation Classification , 2021, AAAI.

[9]  A. Gavai,et al.  Automatic classification of literature in systematic reviews on food safety using machine learning , 2021, Current Research in Food Science.

[10]  Zahra Rezaei Ghahroodi,et al.  Using Machine Learning Classification Algorithms in Official Statistics , 2021 .

[11]  Andrea Roberson Applying Machine Learning for Automatic Product Categorization , 2021 .

[12]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[13]  Hazel Martindale,et al.  Semi-supervised machine learning with word embedding for classification in price statistics , 2020, Data & Policy.

[14]  Tarek M. Harchaoui,et al.  How can big data enhance the timeliness of official statistics , 2018 .

[15]  Shuo Xu,et al.  Bayesian Naïve Bayes classifiers to text classification , 2018, J. Inf. Sci..

[16]  Kaczmirek Lars,et al.  Three Methods for Occupation Coding Based on Statistical Learning , 2017 .

[17]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[18]  Roberto Rigobon,et al.  The Billion Prices Project: Using Online Prices for Measurement and Research , 2016 .

[19]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[20]  Kevin J. Fox,et al.  Scanner data, time aggregation and the construction of price indexes , 2011 .

[21]  Jan de Haan,et al.  Eliminating chain drift in price indexes based on scanner data , 2011 .

[22]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[23]  Kurt Hornik,et al.  Open-source machine learning: R meets Weka , 2009, Comput. Stat..

[24]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[25]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[26]  L. Breiman Random Forests , 2001, Encyclopedia of Machine Learning and Data Mining.

[27]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Bogdan Oancea,et al.  Web scraping techniques for price statistics – the Romanian experience , 2019, Statistical Journal of the IAOS.

[29]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[30]  Panayiotis Bozanis,et al.  Advances in Informatics, 10th Panhellenic Conference on Informatics, PCI 2005, Volos, Greece, November 11-13, 2005, Proceedings , 2005, Panhellenic Conference on Informatics.

[31]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[32]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.