The Keyword Extraction of Chinese Medical Web Page Based on WF-TF-IDF Algorithm

Web page keyword extraction is widely used in web text classification, text clustering, and information retrieval. However, the keyword extraction of the Chinese web page still need be improved and applied, especially in the medical field. This paper proposes an improved TF-IDF algorithm based on WF-TF-IDF to extract keywords from Chinese medical web page. The WF-TF-IDF algorithm considers three factors which are word frequency in the title, description and word distribution of categories in the corpus. We do the data-preprocessing which includes web page denoising, regular expression processing, Chinese word segmentation, synonyms exchanging and stop word filtering. Then we extract keywords based on the result of data-preprocessing. We filter the meaningless words in the extracted keywords according to the part of speech. The experimental results shows that the WF-TF-IDF algorithm improves the precision rate and recall rate by about 7% compared to the traditional TF-IDF algorithm.

[1]  Meng Zhao,et al.  Chinese Document Keyword Extraction Algorithm Based on FP-growth , 2016, 2016 International Conference on Smart City and Systems Engineering (ICSCSE).

[2]  Tao Yang,et al.  Research and improvement of feature words weight based on TFIDF algorithm , 2016, 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference.

[3]  Nivio Ziviani,et al.  Selecting keywords to represent web pages using Wikipedia information , 2012, WebMedia.

[4]  Mingyong Liu,et al.  An improvement of TFIDF weighting in text categorization , .

[5]  Qing Wu,et al.  Micro-blog commercial word extraction based on improved TF-IDF algorithm , 2013, 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013).

[6]  Tingshao Zhu,et al.  Hot keyword identification for extracting web public opinion , 2010, 5th International Conference on Pervasive Computing and Applications.

[7]  Ding Qiu-lin Chinese Keyword Extraction Algorithm Based on Synonym Chains , 2010 .

[8]  Ying Qin Applying frequency and location information to keyword extraction in single document , 2012, 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems.

[9]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[10]  Tang Jianbo,et al.  An Improved TFIDF Feature Selection Algorithm Based On Information Entropy , 2006, 2007 Chinese Control Conference.