Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis

Abstract Blockchain has become one of the core technologies in Industry 4.0. To help decision-makers establish action plans based on blockchain, it is an urgent task to analyze trends in blockchain technology. However, most of existing studies on blockchain trend analysis are based on effort demanding full-text investigation or traditional bibliometric methods whose study scope is limited to a frequency-based statistical analysis. Therefore, in this paper, we propose a new topic modeling method called Word2vec-based Latent Semantic Analysis (W2V-LSA), which is based on Word2vec and Spherical k-means clustering to better capture and represent the context of a corpus. We then used W2V-LSA to perform an annual trend analysis of blockchain research by country and time for 231 abstracts of blockchain-related papers published over the past five years. The performance of the proposed algorithm was compared to Probabilistic LSA, one of the common topic modeling techniques. The experimental results confirmed the usefulness of W2V-LSA in terms of the accuracy and diversity of topics by quantitative and qualitative evaluation. The proposed method can be a competitive alternative for better topic modeling to provide direction for future research in technology trend analysis and it is applicable to various expert systems related to text mining.

[1]  Yuen-Hsien Tseng,et al.  Text mining techniques for patent analysis , 2007, Inf. Process. Manag..

[2]  Daniel Sierra-Sosa,et al.  Trends on Health in Social Media: Analysis using Twitter Topic Modeling , 2018, 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[3]  P. Giungato,et al.  Current Trends in Sustainability of Bitcoins and Related Blockchain Technology , 2017 .

[4]  Praneeth Babu Marella,et al.  Ancile: Privacy-Preserving Framework for Access Control and Interoperability of Electronic Health Records Using Blockchain Technology , 2018 .

[5]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[6]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[7]  Panos E. Kourouthanassis,et al.  Measuring Service Quality From Unstructured Data: A Topic Modeling Application on Airline Passengers’ Online Reviews , 2018, Expert Syst. Appl..

[8]  Mahdi H. Miraz,et al.  Blockchain Enabled Enhanced IoT Ecosystem Security , 2018, ArXiv.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[11]  Hiroshi Tsuji,et al.  Trends Recognition in Journal Papers by Text Mining , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[12]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[13]  Simon Hengchen,et al.  Scrambling for Metadata: Using Topic Modeling and Word2Vec to Explore the Archives of the European Commission , 2017 .

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[16]  Sungjoo Lee,et al.  An approach to discovering new technology opportunities: Keyword-based patent map approach , 2009 .

[17]  Kurt Hornik,et al.  Spherical k-Means Clustering , 2012 .

[18]  Ke Zhang,et al.  Examining mobile learning trends 2003–2008: a categorical meta-trend analysis using text mining techniques , 2011, Journal of Computing in Higher Education.

[19]  Melanie Swan,et al.  Blockchain: Blueprint for a New Economy , 2015 .

[20]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[21]  Hua Xu,et al.  Chinese comments sentiment classification based on word2vec and SVMperf , 2015, Expert Syst. Appl..

[22]  Kai Fan,et al.  MedBlock: Efficient and Secure Medical Data Sharing Via Blockchain , 2018, Journal of Medical Systems.

[23]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[24]  Mehdi Sookhak,et al.  The Evolution of Blockchain: A Bibliometric Study , 2019, IEEE Access.

[25]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[26]  Sanjay Misra,et al.  Representing Contexual Relations with Sanskrit Word Embeddings , 2017, ICCSA.

[27]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[28]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[29]  Maria Fernanda Moura,et al.  Latent association rule cluster based model to extract topics for classification and recommendation applications , 2018, Expert Syst. Appl..

[30]  D. Newman,et al.  Probabilistic topic decomposition of an eighteenth-century American newspaper , 2006 .

[31]  Sungjoo Lee,et al.  Keyword selection and processing strategy for applying text mining to patent analysis , 2015, Expert Syst. Appl..

[32]  Yue Lu,et al.  Opinion integration through semi-supervised topic modeling , 2008, WWW.

[33]  Kumiko Miyazaki,et al.  Evaluating the effectiveness of keyword search strategy for patent identification , 2013 .

[34]  Kyung-shik Shin,et al.  Text Mining-Based Emerging Trend Analysis for the Aviation Industry , 2015 .

[35]  Xiaochun Ni,et al.  A Bibliometric Analysis of Blockchain Research , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[36]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[37]  Maher Alharby,et al.  A Systematic Mapping Study on Current Research Topics in Smart Contracts , 2017 .

[38]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[39]  Yang Lu,et al.  The blockchain: State-of-the-art and research challenges , 2019, J. Ind. Inf. Integr..

[40]  Sooyong Park,et al.  Where Is Current Research on Blockchain Technology?—A Systematic Review , 2016, PloS one.

[41]  Kim-Kwang Raymond Choo,et al.  Blockchain in healthcare applications: Research challenges and opportunities , 2019, J. Netw. Comput. Appl..

[42]  Yeonjae Jung,et al.  Mining the voice of employees: A text mining approach to identifying and analyzing job satisfaction factors from online employee reviews , 2019, Decis. Support Syst..

[43]  Changhee Kim,et al.  Analysis of the Trends in Biochemical Research Using Latent Dirichlet Allocation (LDA) , 2019, Processes.

[44]  Isabel de la Torre Díez,et al.  Proposing New Blockchain Challenges in eHealth , 2019, Journal of Medical Systems.

[45]  Jui-long Hung,et al.  Trends of e-learning research from 2000 to 2008: Use of text mining and bibliometrics , 2012, Br. J. Educ. Technol..

[46]  Zibin Zheng,et al.  An Overview of Blockchain Technology: Architecture, Consensus, and Future Trends , 2017, 2017 IEEE International Congress on Big Data (BigData Congress).

[47]  Dursun Delen,et al.  Medical informatics research trend analysis: A text mining approach , 2018, Health Informatics J..

[48]  Gareth W. Peters,et al.  Trends in Crypto-Currencies and Blockchain Technologies: A Monetary Theory and Regulation Perspective , 2015, ArXiv.

[49]  Khalid Alfalqi,et al.  A Survey of Topic Modeling in Text Mining , 2015 .

[50]  Jiann-Min Yang,et al.  Bibliometrics-based evaluation of the Blockchain research trend: 2008 – March 2017 , 2018, Technol. Anal. Strateg. Manag..