Text Keyword Extraction Based on Multi-dimensional Features

Keyword extraction is a fundamental task of text mining, so extracting high-quality keywords is of great significance. Typical keyword extraction algorithms usually rely on the statistical features, but lack of the semantic information. At the same time, the supervised keyword extraction algorithms rely too much on sample labeling. Therefore, in this paper, an unsupervised keyword extraction algorithm based on multi-dimensional features called MDFKE is proposed, which combines statistical features, external knowledge-based features and semantic features. MDFKE mainly studies the semantic information of candidate keywords. LDA model is used to obtain text topic, and Word2vec word embedding is used to generate word vectors. Based on these, the similarity between candidate keyword and text topic is quantified as semantic feature. Nine specific features are extracted from five aspects: term frequency, length, position, external knowledge base, and semantics. Finally, this paper clusters on feature vectors to obtain the final keyword set. The experiment turns out that, compared with traditional keyword extraction algorithms based on statistical features, MDFKE can significantly improve extraction performance, and can also make up for the shortage of supervised learning overly relying on labels.

[1]  Xiaodan Zhang,et al.  MaxMatcher: biological concept extraction using approximate dictionary lookup , 2006 .

[2]  Lei Xu,et al.  Semantic Web Service Discovery Based on LDA Clustering , 2019, WISA.

[3]  Ali Jaoua,et al.  Text Categorization Using Hyper Rectangular Keyword Extraction: Application to News Articles Classification , 2015, RAMICS.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Saïd Abdeddaïm,et al.  Accurate keyphrase extraction by discriminating overlapping phrases , 2014, J. Inf. Sci..

[6]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[7]  Zhou Qingyun,et al.  Keyword Extraction Method for Complex Nodes Based on TextRank Algorithm , 2020, 2020 International Conference on Computer Engineering and Application (ICCEA).

[8]  Timothy Baldwin,et al.  Automatic keyphrase extraction from scientific articles , 2013, Lang. Resour. Evaluation.

[9]  Xindong Wu,et al.  Efficient sequential pattern mining with wildcards for keyphrase extraction , 2017, Knowl. Based Syst..

[10]  Gábor Berend Exploiting extra-textual and linguistic information in keyphrase extraction , 2016, Nat. Lang. Eng..

[11]  Lizhen Xu,et al.  Topic Classification Based on Improved Word Embedding , 2017, 2017 14th Web Information Systems and Applications Conference (WISA).

[12]  Ricardo Campos,et al.  A Text Feature Based Automatic Keyword Extraction Method for Single Documents , 2018, ECIR.

[13]  Laura Cristina Lanzarini,et al.  Keyword Identification in Spanish Documents using Neural Networks , 2015 .

[14]  Zhang Chi,et al.  Research on News Keyword Extraction Technology Based on TF-IDF and TextRank , 2019, 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS).

[15]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[16]  Guangming Lu,et al.  Research on Text Classification Based on TextRank , 2016, ICC 2016.