Multi-label classification and interactive NLP-based visualization of electric vehicle patent data

Abstract The objectives of this study are to (1) interactively visualize information embedded in patent texts, and (2) train a high-accuracy multi-label classification algorithm capable of classifying patents into multiple cooperative patent classification (CPC) classes. The case study involved metadata and text data of 17,500 electric vehicle patents. To these ends, the following methodology was applied: First, feature engineering was based on topic extraction from patent texts using latent dirichlet analysis (LDA) and the perplexity metric. Second, the multi-label implementations of the random forest, decision trees, and KNN algorithms were trained on the data in order to predict multiple class labels corresponding to a given electric vehicle patent. The results of this study were promising, with the best scores for performance metrics such as accuracy, precision, recall, f-score, and hamming loss being 0.91, 0.92, 0.74, and 0.02 respectively. The implications of our results are two-fold: firstly, we present the effectiveness of using open-source tools for customized patent analysis pipelines including interactive data visualization and machine learning. Secondly, our results provide a strong basis for automated multi-label patent classification into CPC classes.

[1]  Yuqing Wu,et al.  The Analysis of Transdisciplinary Integration Characteristic for China's Pure Electric Vehicles Technology from Patent Perspective ☆ , 2017 .

[2]  Horacio Saggion,et al.  Using genre-specific features for patent summaries , 2017, Inf. Process. Manag..

[3]  Yuqing Wu,et al.  Analysis of Research and Development Trend of the Battery Technology in Electric Vehicle with the Perspective of Patent , 2017 .

[4]  Guozhong Cao,et al.  Key Technologies for Sustainable Design Based on Patent Knowledge Mining , 2016 .

[5]  Pei-Chann Chang,et al.  A patent quality analysis and classification system using self-organizing maps with support vector machine , 2016, Appl. Soft Comput..

[6]  Scott P. Johnson,et al.  Machine Learning and Natural Language Processing on the Patent Corpus: Data, Tools, and New Measures , 2018, Journal of Economics & Management Strategy.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Xiaoyu Zhang,et al.  Interactive patent classification based on multi-classifier fusion and active learning , 2014, Neurocomputing.

[9]  Xin Wang,et al.  Identifying Core Technology Structure of Electric Vehicle Industry through Patent Co-citation Information , 2011 .

[10]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[11]  Hune Cho,et al.  Technological advances in the fuel cell vehicle: Patent portfolio management , 2015 .

[12]  Xindong Wu,et al.  Neighbor selection for multilabel classification , 2016, Neurocomputing.

[13]  Farshad Madani,et al.  The evolution of patent mining: Applying bibliometrics analysis and keyword network analysis , 2016 .

[14]  Kwangsoo Kim,et al.  Monitoring emerging technologies for technology planning using technical keyword based analysis from patent data , 2017 .

[15]  Francisco Charte,et al.  Addressing imbalance in multilabel classification: Measures and random resampling algorithms , 2015, Neurocomputing.

[16]  Denis Cavallucci,et al.  Natural Language Processing (NLP) - A solution for knowledge extraction from patent unstructured data , 2015 .

[17]  Jean-Michel Poggi,et al.  Random Forests for Big Data , 2015, Big Data Res..

[18]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[19]  S. Abe Fuzzy support vector machines for multilabel classification , 2015, Pattern Recognit..

[20]  Sungjoo Lee,et al.  Deriving technology intelligence from patents: Preposition-based semantic analysis , 2018, J. Informetrics.

[21]  Subhashini Venugopalan,et al.  Topic based classification and pattern identification in patents , 2015 .

[22]  David Robinson,et al.  Text Mining with R: A Tidy Approach , 2017 .

[23]  Piotr Masiakowski,et al.  Integration of software tools in patent analysis , 2013 .

[24]  Sebastián Ventura,et al.  A Tutorial on Multilabel Learning , 2015, ACM Comput. Surv..

[25]  Josef Kittler,et al.  Multilabel classification using heterogeneous ensemble of multi-label classifiers , 2012, Pattern Recognit. Lett..

[26]  Xiaohong Chen,et al.  Short Term Prediction of Freeway Exiting Volume Based on SVM and KNN , 2015 .

[27]  Khizir Mahmud,et al.  Integration of electric vehicles and management in the internet of energy , 2018 .

[28]  Denis Cavallucci,et al.  A lexico-syntactic pattern matching method to extract IDM- TRIZ knowledge from on-line patent databases , 2015 .

[29]  Wendy D. Cornell,et al.  Application of an automated natural language processing (NLP) workflow to enable federated search of external biomedical content in drug discovery and development. , 2016, Drug discovery today.

[30]  Markus Bundschus,et al.  Text mining patents for biomedical knowledge. , 2016, Drug discovery today.

[31]  Yasser Mohamed,et al.  A method for clustering unlabeled BIM objects using entropy and TF-IDF with RDF encoding , 2017, Adv. Eng. Informatics.

[32]  Hao Zhang,et al.  Turning from TF-IDF to TF-IGM for term weighting in text classification , 2016, Expert Syst. Appl..

[33]  Yibo Wang,et al.  Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud , 2018, Decis. Support Syst..

[34]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[35]  Pavan Badami,et al.  Can Li-Ion batteries be the panacea for automotive applications? , 2017 .

[36]  Vigna Kumaran Ramachandaramurthy,et al.  A review on the state-of-the-art technologies of electric vehicle, its impacts and prospects , 2015 .

[37]  Arho Suominen,et al.  Firms' knowledge profiles: Mapping patent data with unsupervised learning , 2017 .

[38]  Gian Luca Foresti,et al.  Diversity-aware classifier ensemble selection via f-score , 2016, Inf. Fusion.

[39]  Gerhard Schewe,et al.  Paving the Road to Electric Vehicles – A Patent Analysis of the Automotive Supply Industry , 2017 .

[40]  Ugo Erra,et al.  Approximate TF-IDF based on topic extraction from massive message stream using the GPU , 2015, Inf. Sci..

[41]  Luis Enrique Sucar,et al.  Hierarchical multilabel classification based on path evaluation , 2016, Int. J. Approx. Reason..

[42]  Ge Cheng,et al.  Forecasting emerging technologies: A supervised learning approach through patent analysis , 2017 .

[43]  Vili Podgorelec,et al.  Text classification method based on self-training and LDA topic models , 2017, Expert Syst. Appl..

[44]  Kwangsoo Kim,et al.  A patent intelligence system for strategic technology planning , 2013, Expert Syst. Appl..

[45]  D. Štreimikienė,et al.  A comprehensive review of data envelopment analysis (DEA) approach in energy efficiency , 2017 .

[46]  Oh-Jin Kwon,et al.  Early identification of emerging technologies: A machine learning approach using multiple patent indicators , 2018 .