Algorithm for Bengali Keyword Extraction

We present algorithm for keyword extraction from a Bengali document. In natural language processing (NLP), keyword extraction is the automated process to identify a set of terms that represent the information discussed in a document. A lot of research works have been done for keyword extraction in resource rich languages. Some of those works followed supervised approach using specific corpus whereas the latest techniques use unsupervised approach. Keyword extraction procedure already achieved state-of-the-art performance for the resource rich languages. Only a few works have been done on the keyword extraction for documents in Bengali but none of them could achieve > 70% accuracy. In this article, we discuss the methods for extracting Bengali keywords from a specific document collection following unsupervised learning approach. Generally, Bengali keyword extraction is difficult in terms of words parsing, stemming, excluding stop words etc. The accuracy of those modules also impact the performance of the keyword extraction procedure. However, we obtained 87% accuracy to identify the correct Bengali keywords from a document. The procedure we have discussed for keyword extraction can also be applied to any language; but here we have provided all of our experimental results specifically for Bengali language.

[1]  Md. Ruhul Amin,et al.  Language independent statistical approach for extracting keywords , 2017, 2017 4th International Conference on Advances in Electrical Engineering (ICAEE).

[2]  Adnan Ahmad,et al.  Bengali Document Clustering Using Word Movers Distance , 2018, 2018 International Conference on Bangla Speech and Language Processing (ICBSLP).

[3]  Sanda Martinčić-Ipšić,et al.  An Overview of Graph-Based Keyword Extraction Methods and Approaches , 2015 .

[4]  Adnan Ahmad,et al.  Bengali word embeddings and it's application in solving document classification problem , 2016, 2016 19th International Conference on Computer and Information Technology (ICCIT).

[5]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[6]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[7]  Yuen-Hsien Tseng Multilingual keyword extraction for term suggestion , 1998, SIGIR '98.

[8]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[9]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[10]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[11]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[12]  Adnan Ahmad,et al.  Pipilika N-Gram Viewer: An Efficient Large Scale N-Gram Model for Bengali , 2018, 2018 International Conference on Bangla Speech and Language Processing (ICBSLP).