Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction

Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite the existence of other suitable alternatives, including graph-based models. In this paper, we perform an exhaustive and large-scale empirical comparison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some interesting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioners. We release our code at this https URL .

[1]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[2]  Ludovic Lebart,et al.  Exploring Textual Data , 1997 .

[3]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[4]  Horacio Saggion,et al.  SemEval-2018 Task 9: Hypernym Discovery , 2018, *SEMEVAL.

[5]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[6]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[7]  Xiaojun Wan,et al.  Single Document Keyphrase Extraction Using Neighborhood Knowledge , 2008, AAAI.

[8]  Roberto Navigli,et al.  NASARI: a Novel Approach to a Semantically-Aware Representation of Items , 2015, NAACL.

[9]  Jiaul H. Paik A novel TF-IDF weighting scheme for effective ranking , 2013, SIGIR.

[10]  Sanda Martinčić-Ipšić,et al.  An Overview of Graph-Based Keyword Extraction Methods and Approaches , 2015 .

[11]  Florian Boudin,et al.  Unsupervised Keyphrase Extraction with Multipartite Graphs , 2018, NAACL.

[12]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[13]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[14]  Florian Boudin,et al.  TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction , 2013, IJCNLP.

[15]  Alexander Schutz,et al.  Keyphrase Extraction from Single Documents in the Open Domain Exploiting Linguistic and Statistical Methods , 2008 .

[16]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[17]  Yan Li,et al.  An improved term weighting scheme for text classification , 2020, Concurr. Comput. Pract. Exp..

[18]  Taoufiq Gadi,et al.  Ranking of text documents using TF-IDF weighting and association rules mining , 2018, 2018 4th International Conference on Optimization and Applications (ICOA).

[19]  Evelyne Jacquey,et al.  Annotation sémantique et validation terminologique en texte intégral en SHS , 2014 .

[20]  Roberto Navigli,et al.  SensEmBERT: Context-Enhanced Sense Embeddings for Multilingual Word Sense Disambiguation , 2020, AAAI.

[21]  Ricardo Campos,et al.  YAKE! Keyword extraction from single documents using multiple local features , 2020, Inf. Sci..

[22]  Cornelia Caragea,et al.  Extracting Keyphrases from Research Papers Using Citation Networks , 2014, AAAI.

[23]  Simone Teufel,et al.  Topical PageRank: A Model of Scientific Expertise for Bibliographic Search , 2014, EACL.

[24]  Ian H. Witten,et al.  Domain-independent automatic keyphrase indexing with small training sets , 2008 .

[25]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[26]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[27]  Si Sun,et al.  Joint Keyphrase Chunking and Salience Ranking with BERT , 2020, ArXiv.

[28]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[29]  Maurizio Marchese,et al.  Large Dataset for Keyphrases Extraction , 2009 .

[30]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[31]  Ahmed A. Rafea,et al.  KP-Miner: A keyphrase extraction system for English and Arabic documents , 2009, Inf. Syst..

[32]  Shibamouli Lahiri,et al.  Keyword extraction from emails* , 2016, Natural Language Engineering.

[33]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[34]  Xiaojun Wan,et al.  CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction , 2008, COLING.

[35]  Francisco J. García-Peñalvo,et al.  Information retrieval methodology for aiding scientific database search , 2018, Soft Comput..

[36]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[37]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[38]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[39]  Isabelle Augenstein,et al.  A simple but tough-to-beat baseline for the Fake News Challenge stance detection task , 2017, ArXiv.

[40]  João Paulo da Silva Neto,et al.  Keyphrase Cloud Generation of Broadcast News , 2013, INTERSPEECH.

[41]  Fei Liu,et al.  A Supervised Framework for Keyword Extraction From Meeting Transcripts , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Grigori Sidorov,et al.  Unsupervised Sentence Representations as Word Information Series: Revisiting TF-IDF , 2017, Comput. Speech Lang..

[43]  Patrick Drouin,et al.  Term extraction using non-technical corpora as a point of leverage , 2003 .

[44]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[45]  Chenyan Xiong,et al.  Open Domain Web Keyphrase Extraction Beyond Language Modeling , 2019, EMNLP.

[46]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[47]  Isabelle Augenstein,et al.  SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications , 2017, *SEMEVAL.

[48]  Thomas Demeester,et al.  Topical Word Importance for Fast Keyphrase Extraction , 2015, WWW.

[49]  Brian Kingsbury,et al.  Automatic keyword selection for keyword search development and tuning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[51]  Cornelia Caragea,et al.  PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents , 2017, ACL.

[52]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[53]  P. Lafon Sur la variabilité de la fréquence des formes dans un corpus , 1980 .

[54]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .