Measuring Term Informativeness in Context

Measuring term informativeness is a fundamental NLP task. Existing methods, mostly based on statistical information in corpora, do not actually measure informativeness of a term with regard to its semantic context. This paper proposes a new lightweight feature-free approach to encode term informativeness in context by leveraging web knowledge. Given a term and its context, we model contextaware term informativeness based on semantic similarity between the context and the term’s most featured context in a knowledge base, Wikipedia. We apply our method to three applications: core term extraction from snippets (text segment), scientific keywords extraction (paper), and back-of-the-book index generation (book). The performance is state-of-theart or close to it for each application, demonstrating its effectiveness and generality.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Elena Maceviciute,et al.  Review of : Choo, C.W. Information management for the intelligent organization: the art of scanning the environment. 3rd ed. Medford, NJ: Information Today, Inc., 2002 , 2003 .

[3]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[4]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[5]  Daniel Marcu,et al.  Sentence Level Discourse Parsing using Syntactic and Lexical Information , 2003, NAACL.

[6]  Daniel Kifer,et al.  Context-aware citation recommendation , 2010, WWW '10.

[7]  Le Zhao,et al.  Term necessity prediction , 2010, CIKM.

[8]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[9]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[10]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[11]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[12]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[13]  Rada Mihalcea,et al.  Investigations in Unsupervised Back-of-the-Book Indexing , 2007, FLAIRS.

[14]  Rada Mihalcea,et al.  Creating a Testbed for the Evaluation of Automatically Generated Back-of-the-Book Indexes , 2006, CICLing.

[15]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[16]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[17]  RobertsonStephen,et al.  Karen Sprck Jones , 2008 .

[18]  Rada Mihalcea,et al.  Linguistically Motivated Features for Enhanced Back-of-the-Book Indexing , 2008, ACL.

[19]  Ian H. Witten,et al.  Domain-independent automatic keyphrase indexing with small training sets , 2008 .

[20]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[21]  Tommi S. Jaakkola,et al.  Using term informativeness for named entity detection , 2005, SIGIR '05.

[22]  Andrei Popescu-Belis,et al.  Computing text semantic relatedness using the contents and links of a hypertext encyclopedia , 2013, Artif. Intell..

[23]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[24]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Kishore Papineni,et al.  Why Inverse Document Frequency? , 2001, NAACL.

[26]  Virgil Diodato,et al.  Back of book indexes and the characteristics of author and nonauthor indexing: Report of an exploratory study , 1991, J. Am. Soc. Inf. Sci..

[27]  Myeong-Kwan Kevin Cheon,et al.  Frank and I , 2012 .

[28]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[29]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[30]  Kirill Kireyev,et al.  Semantic-based Estimation of Term Informativeness , 2009, NAACL.

[31]  Laurent Romary,et al.  HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID , 2010, *SEMEVAL.

[32]  Grace Hui Yang,et al.  A Metric-based Framework for Automatic Taxonomy Induction , 2009, ACL.

[33]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[34]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[35]  Virgil Diodato User Preferences for Features in Back of Book Indexes , 1994, J. Am. Soc. Inf. Sci..

[36]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[37]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[38]  Ian H. Witten,et al.  Topic indexing with Wikipedia , 2008 .

[39]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[40]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[41]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[42]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[43]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.