TFIDF based Feature Words Extraction and Topic Modeling for Short Text

In this paper, feature words extraction and topic modeling based on Term Frequency times In-verse Document Frequency (TFIDF) and Latent Dirichlet Allocation (LDA) is achieved aiming at short titles text of The National Institutes of Health (NIH) supported research. After preprocess of raw text, distinct terms extracted from titles of NIH supported research compose the Bag of Words. And then TFIDF is used in order to re-weight the count features for reducing the influence of more frequent yet less valuable terms and enhancing the influence of rarer yet more valuable terms. Via topic modeling with Latent Dirichlet Allocation, ten topics and corresponding three feature words are extracted, and each topic is characterized by the three special words. As a result, it can achieve topic extraction of NIH supported research according to titles and reveal the most probable research area.

[1]  K. Raghuveer,et al.  Legal Documents Clustering and Summarization using Hierarchical Latent Dirichlet Allocation , 2013 .

[2]  Baoyi Wang,et al.  A NOVEL FEATURE SELECTION ALGORITHM FOR TEXT CLASSIFICATION BASED ON TFIDF-WEIGHT AND KL-DIVERGENCE , 2005 .

[3]  Behzad Moshiri,et al.  Improve text classification accuracy based on classifier fusion methods , 2007, 2007 10th International Conference on Information Fusion.

[4]  Bella Hass Weinberg,et al.  Predecessors of Scientific Indexing Structures in the Domain of Religion , 2004 .

[5]  Ali Shokouhi Rostami,et al.  Improvement Tfidf for News Document Using Efficient Similarity , 2012 .

[6]  Gilles Louppe,et al.  Independent consultant , 2013 .

[7]  Jeffrey D. Ullman,et al.  Mining of Massive Datasets: Data Mining , 2011 .

[8]  Hong Fei Sun,et al.  Study on the Improvement of TFIDF Algorithm in Data Mining , 2014 .

[9]  John Willinsky,et al.  Public Access and Use of Health Research: An Exploratory Study of the National Institutes of Health (NIH) Public Access Policy Using Interviews and Surveys of Health Personnel , 2011, Journal of medical Internet research.

[10]  Tao Yang,et al.  Research and improvement of feature words weight based on TFIDF algorithm , 2016, 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference.

[11]  Ravi kumar Venkatesh,et al.  Legal Documents Clustering and Summarization using Hierarchical Latent Dirichlet Allocation , 2013 .

[12]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[13]  Erik B. Sudderth,et al.  Refinery: An Open Source Topic Modeling Web Platform , 2017, J. Mach. Learn. Res..

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..