Towards Classifying HTML-embedded Product Data Based On Machine Learning Approach

In this paper we explored machine learning approaches using descriptions and titles to classify footwear by brand. The provided data were taken from many different online stores. In particular, we have created a pipeline that automatically classifies product brands based on the provided data. The dataset is provided in JSON format and contains more than 40,000 rows. The categorization component was implemented using K-Nearest Neighbour (K-NN) and Support Vector Machine (SVM) algorithms. The results of the pipeline construction were evaluated basing on the classification report, especially the Precision weighted average value was considered during the calculation, which reached 79.0% for SVM and 72.0% for K-NN.

[1]  Ming-Yang Su,et al.  Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification , 2011, J. Netw. Comput. Appl..

[2]  Kenta Mikawa,et al.  A proposal of extended cosine measure for distance metric learning in text classification , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[3]  Sang-goo Lee,et al.  A semantic classification model for e-catalogs , 2004, Proceedings. IEEE International Conference on e-Commerce Technology, 2004. CEC 2004..

[4]  Jianfu Chen,et al.  Cost-sensitive learning for large-scale hierarchical classification , 2013, CIKM.

[5]  Yaxin Bi,et al.  KNN Model-Based Approach in Classification , 2003, OTM.

[6]  Yun Peng,et al.  A Technique of E-Commerce Goods Classification and Evaluation Based on Fuzzy Set , 2010, 2010 International Conference on Internet Technology and Applications.

[7]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[8]  Xia Li,et al.  An improved KNN algorithm for text classification , 2010, 2010 International Conference on Information, Networking and Automation (ICINA).

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[13]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.