A two-step approach toward subject prediction
暂无分享,去创建一个
Automatic subject prediction is a desirable feature for modern digital library systems, as manual indexing could no longer cope with the rapid growth of digital collections. Data sparsity and model scalability are the major challenges to solving this extreme multi-label classification problem automatically. In this research-in-progress paper, we propose to address this problem using a two-step approach. We first propose to use an efficient and effective embedding method that embed terms, subjects and documents into the same semantic space, where similarity could be computed easily. We then describe a novel Non-Parametric Subject Prediction (NPSP) method and show how effectively it predicts even very specialised subjects, which are associated with few documents in the training set and are more problematic for a classifier.