Specificity is the level of detail at which a given term is represented. Existing approaches to estimating term specificity are primarily dependent on corpus-level frequency statistics. In this work, we explore how neural embeddings can be used to define corpus-independent specificity metrics. Particularly, we propose to measure term specificity based on the distribution of terms in the neighborhood of the given term in the embedding space. The intuition is that a term that is surrounded by other terms in the embedding space is more likely to be specific while a term surrounded by less closely related terms is more likely to be generic. On this basis, we leverage geometric properties between embedded terms to define three groups of metrics: (1) neighborhood-based, (2) graph-based and (3) cluster-based metrics. Moreover, we employ learning-to-rank techniques to estimate term specificity in a supervised approach by employing the three proposed groups of metrics. We curate and publicly share a test collection of term specificity measurements defined based on Wikipedia's category hierarchy. We report on our experiments through metric performance comparison, ablation study and comparison against the state-of-the-art baselines.
[1]
Bhaskar Mitra,et al.
An Introduction to Neural Information Retrieval
,
2018,
Found. Trends Inf. Retr..
[2]
J. Shane Culpepper,et al.
Neural Query Performance Prediction using Weak Supervision from Multiple Signals
,
2018,
SIGIR.
[3]
Amit P. Sheth,et al.
User Interests Identification on Twitter Using a Hierarchical Knowledge Base
,
2014,
ESWC.
[4]
Zhiting Hu,et al.
Joint Embedding of Hierarchical Categories and Entities for Concept Categorization and Dataless Classification
,
2016,
COLING.
[5]
Iadh Ounis,et al.
Inferring Query Performance Using Pre-retrieval Predictors
,
2004,
SPIRE.
[6]
Karen Spärck Jones.
A statistical interpretation of term specificity and its application in retrieval
,
2021,
J. Documentation.