TOP-Rank: A Novel Unsupervised Approach for Topic Prediction Using Keyphrase Extraction for Urdu Documents

In Natural Language Processing (NLP), topic modeling is the technique to extract abstract information from documents with huge amount of text. This abstract information leads towards the identification of the topics in the document. One way to retrieve topics from documents is keyphrase extraction. Keyphrases are a set of terms which represent high level description of a document. Different techniques of keyphrase extraction for topic prediction have been proposed for multiple languages i.e. English, Arabic, etc. However, this area needs to be explored for other languages e.g. Urdu. Therefore, in this paper, a novel unsupervised approach for topic prediction for Urdu language has been introduced which is able to extract more significant information from the documents. For this purpose, the proposed TOP-Rank system extracts keywords from the document and ranks them according to their position in a sentence. These keywords along with their ranking scores are utilized to generate keyphrases by applying syntactic rules to extracts more meaningful topics. These keyphrases are ranked according to the keywords scores and re-ranked with respect to their positions in the document. Finally, our proposed model identifies top-ranked keyphrases as topical significance and keyphrase with the highest score is selected as the topic of the document. Experiments are performed on two different datasets and performance of the proposed system is compared with existing state-of-the-art techniques. Results have shown that our proposed system outperforms existing techniques and holds the ability to produce more meaningful topics.

[1]  Doug Downey,et al.  A Semantic Cover Approach for Topic Modeling , 2019, *SEMEVAL.

[2]  Rui Wang,et al.  A Two-Level Keyphrase Extraction Approach , 2015, CICLing.

[3]  Abdalfattah M. Alfarra,et al.  Graph-Based Technique for Extracting Keyphrases in a Single-Document (GTEK) , 2018, 2018 International Conference on Promising Electronic Technologies (ICPET).

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Florian Boudin,et al.  Unsupervised Keyphrase Extraction with Multipartite Graphs , 2018, NAACL.

[6]  Munam Ali Shah,et al.  Hierarchical Topic Modeling for Urdu Text Articles , 2019, 2019 25th International Conference on Automation and Computing (ICAC).

[7]  LiXiming,et al.  Supervised topic models for multi-label classification , 2015 .

[8]  Noémie Elhadad,et al.  An Unsupervised Aspect-Sentiment Model for Online Reviews , 2010, NAACL.

[9]  Jong-Mo Seo,et al.  A news-topic recommender system based on keywords extraction , 2017, Multimedia Tools and Applications.

[10]  Atif Mehmood,et al.  Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network , 2020, IEEE Access.

[11]  Yu-N Cheah,et al.  Topic Modeling in Sentiment Analysis: A Systematic Review , 2016 .

[12]  Muhammad Usman,et al.  Urdu Text Classification using Majority Voting , 2016 .

[13]  Kaveh Bastani,et al.  Latent Dirichlet Allocation (LDA) for Topic Modeling of the CFPB Consumer Complaints , 2018, Expert Syst. Appl..

[14]  Mubashir Ali,et al.  A framework of Urdu topic modeling using latent dirichlet allocation (LDA) , 2018, 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC).

[15]  Rongrong Ji,et al.  Social Media Based Topic Modeling for Smart Campus: A Deep Topical Correlation Analysis Method , 2019, IEEE Access.

[16]  Thomas Reutterer,et al.  Topic modeling in marketing: recent advances and research opportunities , 2018, Journal of Business Economics.

[17]  James H. Martin,et al.  SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction , 2015, *SEMEVAL.

[18]  Jiawei Han,et al.  Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents , 2014, SDM.

[19]  Daryl Essam,et al.  Sentiment Analysis System for Roman Urdu , 2018 .

[20]  Kiyoaki Shirai,et al.  Topic Modeling based Sentiment Analysis on Social Media for Stock Market Prediction , 2015, ACL.

[21]  M. Alhawarat,et al.  Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents , 2018, IEEE Access.

[22]  Haoran Xie,et al.  Siamese Network-Based Supervised Topic Modeling , 2018, EMNLP.

[23]  Weifeng Li,et al.  Supervised Topic Modeling Using Hierarchical Dirichlet Process-Based Inverse Regression: Experiments on E-Commerce Applications , 2018, IEEE Transactions on Knowledge and Data Engineering.

[24]  Muhammad Fayyaz,et al.  Exploring deep learning approaches for Urdu text classification in product manufacturing , 2020, Enterp. Inf. Syst..

[25]  Waqar Ali,et al.  Statistical Topic Modeling for Urdu Text Articles , 2018, 2018 24th International Conference on Automation and Computing (ICAC).

[26]  Naima Iltaf,et al.  Extension of Semantic Based Urdu Linguistic Resources Using Natural Language Processing , 2017, 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[27]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[28]  Shehzad Khalid,et al.  Framework for Urdu News Headlines Classification , 2016 .

[29]  Xiaojun Wan,et al.  Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction , 2007, ACL.

[30]  Cornelia Caragea,et al.  PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents , 2017, ACL.

[31]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.

[32]  Jia Zeng,et al.  Fast Online EM for Big Topic Modeling , 2016, IEEE Transactions on Knowledge and Data Engineering.

[33]  Brandon Chao,et al.  Automated Movie Genre Classification with LDA-based Topic Modeling , 2016 .

[34]  Ruifang He,et al.  Topic Extraction of Events on Social Media Using Reinforced Knowledge , 2018, KSEM.

[35]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[36]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[37]  Qaiser Abbas,et al.  Comparative Study of Feature Selection Approaches for Urdu Text Categorization , 2015 .

[38]  Florian Boudin,et al.  TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction , 2013, IJCNLP.