Concept-oriented labelling of patent images based on Random Forests and proximity-driven generation of synthetic data

Patent images are very important for patent examiners to understand the contents of an invention. Therefore there is a need for automatic labelling of patent images in order to support patent search tasks. Towards this goal, recent research works propose classification-based approaches for patent image annotation. However, one of the main drawbacks of these methods is that they rely upon large annotated patent image datasets, which require substantial manual effort to be obtained. In this context, the proposed work performs extraction of concepts from patent images building upon a supervised machine learning framework, which is trained with limited annotated data and automatically generated synthetic data. The classification is realised with Random Forests (RF) and a combination of visual and textual features. First, we make use of RF’s implicit ability to detect outliers to rid our data of unnecessary noise. Then, we generate new synthetic data cases by means of Synthetic Minority Over-sampling Technique (SMOTE). We evaluate the different retrieval parts of the framework by using a dataset from the footwear domain. The results of the experiments indicate the benefits of using the proposed methodology.

[1]  Mohammad Zulkernine,et al.  A hybrid network intrusion detection technique using random forests , 2006, First International Conference on Availability, Reliability and Security (ARES'06).

[2]  Yunming Ye,et al.  An improved random forest classifier for image classification , 2012, 2012 IEEE International Conference on Information and Automation.

[3]  Mantao Xu,et al.  Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding , 2006, 2006 8th international Conference on Signal Processing.

[4]  Ioannis Kompatsiaris,et al.  Concept-based patent image retrieval , 2012 .

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[7]  He-Yong Wang,et al.  Combination approach of SMOTE and biased-SVM for imbalanced datasets , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[8]  Frédéric Jurie,et al.  Randomized Clustering Forests for Image Classification , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Yiannis Kompatsiaris,et al.  Content-based binary image retrieval using the adaptive hierarchical density histogram , 2011, Pattern Recognit..

[10]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  G. Clark,et al.  Reference , 2008 .

[13]  Veena Bansal,et al.  PATSEEK: Content Based Image Retrieval System for Patent Database , 2004, ICEB.

[14]  Sheng Chen,et al.  A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems , 2011, Neurocomputing.

[15]  Sabine Schulte im Walde,et al.  A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities , 2013, EMNLP.

[16]  Symeon Papadopoulos,et al.  Towards content-based patent image retrieval: A framework perspective , 2010 .

[17]  Gabriela Csurka,et al.  XRCE's Participation at Patent Image Classification and Image-based Patent Retrieval Tasks of the Clef-IP 2011 , 2011, CLEF.