Evaluation and analysis of term scoring methods for term extraction

We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain. Larger collections lead to better terms; all methods are hindered by small collection sizes (below 1000 words). The most flexible method for the extraction of single-word and multi-word terms is pointwise Kullback–Leibler divergence for informativeness and phraseness. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.

[1]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[2]  Gareth J. F. Jones,et al.  ShARe/CLEF eHealth Evaluation Lab 2014, Task 3: User-centred Health Information Retrieval , 2014, CLEF.

[3]  Wei Shen,et al.  An Investigation of the Eectiveness of Concept-based Approach in Medical Information Retrieval GRIUM @ CLEF2014eHealthTask 3 , 2014 .

[4]  Suzan Verberne,et al.  QUINN. Query Updates for News Monitoring , 2015 .

[5]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[6]  Xiaojie Liu,et al.  An Investigation of the Effectiveness of Concept-based Approach in Medical Information Retrieval , 2014, CLEF.

[7]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[8]  José Luis Ortega,et al.  Microsoft academic search and Google scholar citations: Comparative analysis of author profiles , 2014, J. Assoc. Inf. Sci. Technol..

[9]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[10]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.

[11]  Djoerd Hiemstra,et al.  Parsimonious language models for information retrieval , 2004, SIGIR '04.

[12]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[13]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.

[14]  Timothy Baldwin,et al.  Automatic keyphrase extraction from scientific articles , 2013, Lang. Resour. Evaluation.

[15]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[16]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[17]  KimSu Nam,et al.  Automatic keyphrase extraction from scientific articles , 2013 .

[18]  Mark Davies The 385+ million word Corpus of Contemporary American English (1990―2008+): Design, architecture, and linguistic insights , 2009 .

[19]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[20]  ChengXiang Zhai,et al.  Implicit user modeling for personalized search , 2005, CIKM '05.

[21]  Heung-Seon Oh,et al.  A Multiple-Stage Approach to Re-ranking Medical Documents , 2015, CLEF.

[22]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[23]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[24]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[25]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[26]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[27]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[28]  Heung-Seon Oh,et al.  A Multiple-stage Approach to Re-ranking Clinical Documents , 2014, CLEF.

[29]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[30]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[31]  Suzan Verberne A Language-modelling Approach to User-Centred Health Information Retrieval , 2014, CLEF.

[32]  Nelleke Oostdijk,et al.  From D-Coi to SoNaR: a reference corpus for Dutch , 2008, LREC.

[33]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[34]  Katja Hofmann,et al.  The impact of document structure on keyphrase extraction , 2009, CIKM.

[35]  Wessel Kraaij,et al.  Term Extraction for User Profiling: Evaluation by the User , 2013, UMAP Workshops.

[36]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[37]  Jinwook Choi,et al.  Exploring Effective Information Retrieval Technique for the Medical Web Documents: SNUMedinfo at CLEFeHealth2014 Task 3 , 2014, CLEF.

[38]  Wessel Kraaij,et al.  Query Term Suggestion in Academic Search , 2014, ECIR.

[39]  Wessel Kraaij,et al.  User Simulations for Interactive Search: Evaluating Personalized Query Suggestion , 2015, ECIR.

[40]  Leif Azzopardi,et al.  How query cost affects search behavior , 2013, SIGIR.

[41]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[42]  Clement T. Yu,et al.  Automatic indexing using term discrimination and term precision measurements , 1976, Information Processing & Management.

[43]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[44]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[45]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..