Analyzing Scientific Publications using Domain-Specific Word Embedding and Topic Modelling

The scientific world is changing a tarapid pace, with new technology being developed and new trends being set at an increasing frequency. This paper presents a framework for conducting scientific analyses of academic publications, which is crucial to monitor research trends and identify potential innovations. This framework adopts and combines various techniques of Natural Language Processing, such as word embedding and topic modelling. Word embedding is used to capture semantic meanings of domain-specific words. We propose two novel scientific publication embedding, i.e., P UB-G and P UB-W, which are capable of learning semantic meanings of general as well as domain-specific words in various research fields. Thereafter, topic modelling is used to identify clusters of research topics within these larger research fields. We curated apublication dataset consisting of two conferences and two journals from 1995 to 2020 from two research domains. Experimental results show that our PUB-G and PUB-W embeddings are superior in comparison to other baseline embeddings by a margin of ~0.18-1.03 based on topic coherence.

[1]  Kwan Hui Lim,et al.  EPIC30M: An Epidemics Corpus of Over 30 Million Relevant Tweets , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[4]  Geraldo Xexéo,et al.  Word Embeddings: A Survey , 2019, ArXiv.

[5]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[6]  Kwan Hui Lim,et al.  Identifying and Understanding Business Trends using Topic Models with Word Embedding , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[7]  Kwan Hui Lim,et al.  TweetCOVID: A System for Analyzing Public Sentiments and Discussions about COVID-19 via Twitter Activities , 2021, IUI Companion.

[8]  Kwan Hui Lim,et al.  CrisisBERT: A Robust Transformer for Crisis Classification and Contextual Crisis Embedding , 2020, HT.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[11]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[12]  Hajime Shimada,et al.  Identification of Cybersecurity Specific Content Using the Doc2Vec Language Model , 2019, 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC).

[13]  Rika Preiser,et al.  Identifying general trends and patterns in complex systems research: An overview of theoretical and practical implications , 2019, Systems Research and Behavioral Science.

[14]  A. Bonaccorsi,et al.  A Text Mining Based Map of Engineering Design: Topics and their Trajectories Over Time , 2019, Proceedings of the Design Society: International Conference on Engineering Design.

[15]  Minlie Huang,et al.  SentiLARE: Linguistic Knowledge Enhanced Language Representation for Sentiment Analysis , 2019, EMNLP.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[18]  Catherine Ordun,et al.  Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP, and DiGraphs , 2020, ArXiv.

[19]  Kwan Hui Lim,et al.  Understanding Public Sentiments, Opinions and Topics about COVID-19 using Twitter , 2020, 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Lars Guenther,et al.  Science communication as a field of research : identifying trends, challenges and gaps by analysing research papers , 2017 .

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Glenn A. Bowen Document Analysis as a Qualitative Research Method , 2009 .

[24]  Kwan Hui Lim,et al.  Real-time spatio-temporal event detection on geotagged social media , 2021, Frontiers Blockchain.

[25]  André Calero Valdez,et al.  Trends and Changes in the Field of HCI the Last Decade from the Perspective of HCII Conference , 2019, HCI.

[26]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[27]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[28]  Mitchell Harrop,et al.  Paradigms of games research in HCI: a review of 10 years of research at CHI , 2014, CHI PLAY.

[29]  Atsuhiro Takasu,et al.  Bibliographic attribute extraction from erroneous references based on a statistical model , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[30]  Felice Dell'Orletta,et al.  Automatic users extraction from patents , 2018, World Patent Information.

[31]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[32]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[33]  Christof Schöch,et al.  Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama , 2015, Digit. Humanit. Q..

[34]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[35]  John Zimmerman,et al.  Mapping Machine Learning Advances from HCI Research to Reveal Starting Places for Design Innovation , 2018, CHI.

[36]  Ersin Yumer,et al.  Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Katsuhito Sudoh,et al.  Automatic Machine Translation Evaluation using Source Language Inputs and Cross-lingual Language Model , 2020, ACL.

[38]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[39]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[40]  A. Karami,et al.  Exploring research trends in big data across disciplines: A text mining analysis , 2020, J. Inf. Sci..

[41]  Chunhui Yuan,et al.  Research on K-Value Selection Method of K-Means Clustering Algorithm , 2019, J.

[42]  Dominika Tkaczyk,et al.  CERMINE -- Automatic Extraction of Metadata and References from Scientific Literature , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[43]  Animesh Mukherjee,et al.  OCR++: A Robust Framework For Information Extraction from Scholarly Articles , 2016, COLING.

[44]  Erik Cambria,et al.  ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for sentiment analysis , 2021, Future Gener. Comput. Syst..

[45]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[46]  Tao Yang,et al.  Word Embedding for Understanding Natural Language: A Survey , 2018 .