SAO2Vec: Development of an algorithm for embedding the subject–action–object (SAO) structure using Doc2Vec

In natural-language processing, the subject–action–object (SAO) structure is used to convert unstructured textual data into structured textual data comprising subjects, actions, and objects. This structure is suitable for analyzing the key elements of technology, as well as the relationships between these elements. However, analysis using the existing SAO structure requires a substantial number of manual processes because this structure does not represent the context of the sentences. Thus, we introduce the concept of SAO2Vec, in which SAO is used to embed the vectors of sentences and documents, for use in text mining in the analysis of technical documents. First, the technical documents of interest are collected, and SAO structures are extracted from them. Then, sentence vectors are extracted through the Doc2Vec algorithm and are updated using word vectors in the SAO structure. Finally, SAO vectors are drawn using an updated sentence vector with the same SAO structure. In addition, document vectors are derived from the document’s SAO vectors. The results of an experiment in the Internet of things field indicate that the SAO2Vec method produces 3.1% better accuracy than the Doc2Vec method and 115.0% better accuracy than SAO frequency alone. This proves that the proposed SAO2Vec algorithm can be used to improve grouping and similarity analysis by including both the meanings and the contexts of technical elements.

[1]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[2]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Kwangsoo Kim,et al.  Identifying patent infringement using SAO based semantic technological similarities , 2011, Scientometrics.

[5]  Benjamin J. Wilson,et al.  Controlled Experiments for Word Embeddings , 2015, ArXiv.

[6]  Melanie J. Norton,et al.  Introductory Concepts in Information Science , 2010 .

[7]  H. B. Kim,et al.  Semantic SAO network of patents for reusability of inventive knowledge , 2012, 2012 IEEE International Conference on Management of Innovation & Technology (ICMIT).

[8]  Sungjoo Lee,et al.  Using Patent Information for Designing New Product and Technology: Keyword Based Technology Roadmapping , 2008 .

[9]  Ana Mestrovic,et al.  Multilayer Network of Language: a Unified Framework for Structural Analysis of Linguistic Subsystems , 2015, ArXiv.

[10]  Gaetano Cascini,et al.  Measuring patent similarity by comparing inventions functional trees , 2008, IFIP CAI.

[11]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Daeyoung Park,et al.  Merged Ontology and SVM-Based Information Extraction and Recommendation System for Social Robots , 2017, IEEE Access.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Gaetano Cascini,et al.  Natural Language Processing of Patents and Technical Documentation , 2004, Document Analysis Systems.

[15]  Luciano da Fontoura Costa,et al.  Concentric network symmetry grasps authors' styles in word adjacency networks , 2015, ArXiv.

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  Alan L. Porter,et al.  Identification of technology development trends based on subject–action–object analysis: The case of dye-sensitized solar cells , 2015 .

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  P. Kroeger Analyzing Grammar: An Introduction , 2005 .

[20]  Chao Yang,et al.  SAO-based core technological components' identification , 2016, 2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA).

[21]  Sang Yup Lee,et al.  Document vectorization method using network information of words , 2019, PloS one.

[22]  Sujit Bhattacharya,et al.  Mapping a research area at the micro level using co-word analysis , 1998, Scientometrics.

[23]  Euisok Chung,et al.  Class Language Model based on Word Embedding and POS Tagging , 2016 .

[24]  Henda Hajjami Ben Ghézala,et al.  Comparative study of word embedding methods in topic segmentation , 2017, KES.

[25]  Paul R. Kroeger,et al.  Analyzing Grammar: List of abbreviations , 2005 .

[26]  Kwangsoo Kim,et al.  Invention property-function network analysis of patents: a case of silicon-based thin film solar cells , 2011, Scientometrics.

[27]  Dongwoo Kang,et al.  An SAO-based text mining approach to building a technology tree for technology planning , 2012, Expert Syst. Appl..

[28]  Paul R. Kroeger Analyzing Grammar: Frontmatter , 2005 .

[29]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[30]  Alan L. Porter,et al.  Identifying target for technology mergers and acquisitions using patent information and semantic analysis , 2015, 2015 Portland International Conference on Management of Engineering and Technology (PICMET).

[31]  Zhenchao Jiang,et al.  An Unsupervised Graph Based Continuous Word Representation Method for Biomedical Text Mining , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  G. S. Alʹtshuller,et al.  And Suddenly the Inventor Appeared: TRIZ, the Theory of Inventive Problem Solving , 1996 .

[33]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[34]  Brian D. Davison,et al.  Class-Specific Word Embedding through Linear Compositionality , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[35]  Weidong Liu,et al.  Visualizing the intellectual structure and evolution of innovation systems research: a bibliometric analysis , 2015, Scientometrics.

[36]  Sang-Chan Park,et al.  Visualization of patent analysis for emerging technology , 2008, Expert Syst. Appl..

[37]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[38]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[39]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[40]  Sinan Salman,et al.  DIVA: a visualization system for exploring document databases for technology forecasting , 2002 .

[41]  Simon Kuznets,et al.  Inventive Activity: Problems of Definition and Measurement , 1962 .

[42]  Kwangsoo Kim,et al.  SAO network analysis of patents for technology trends identification: a case study of polymer electrolyte membrane technology in proton exchange membrane fuel cells , 2011, Scientometrics.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Qinmin Hu,et al.  Learning Topic-Oriented Word Embedding for Query Classification , 2015, PAKDD.

[45]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[46]  Liang Chen,et al.  A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties , 2003, ACL 2003.

[47]  Bangrae Lee,et al.  Mapping Korea’s national R&D domain of robot technology by using the co-word analysis , 2008, Scientometrics.

[48]  Sandra Müller,et al.  Patent-Based Inventor Profiles as a Basis for Human Resource Decisions in Research and Development , 2005 .

[49]  Diego R. Amancio,et al.  Probing the Topological Properties of Complex Networks Modeling Short Written Texts , 2014, PloS one.

[50]  Yuen-Hsien Tseng,et al.  Text mining techniques for patent analysis , 2007, Inf. Process. Manag..

[51]  Tomoaki Ohtsuki,et al.  A Pattern-Based Approach for Multi-Class Sentiment Analysis in Twitter , 2017, IEEE Access.

[52]  Chao Yang,et al.  Semantic-Based Technology Trend Analysis , 2015, 2015 10th International Conference on Intelligent Systems and Knowledge Engineering (ISKE).

[53]  Zhiyuan Liu,et al.  Incorporating Linguistic Knowledge for Learning Distributed Word Representations , 2015, PloS one.