论文信息 - Sources of Evidence for Automatic Indexing of Political Texts

Sources of Evidence for Automatic Indexing of Political Texts

Political texts on the Web, documenting laws and policies and the process leading to them, are of key importance to government, industry, and every individual citizen. Yet access to such texts is difficult due to the ever increasing volume and complexity of the content, prompting the need for indexing or annotating them with a common controlled vocabulary or ontology. In this paper, we investigate the effectiveness of different sources of evidence—such as the labeled training data, textual glosses of descriptor terms, and the thesaurus structure—for automatically indexing political texts. Our main findings are the following. First, using a learning to rank (LTR) approach integrating all features, we observe significantly better performance than previous systems. Second, the analysis of feature weights reveals the relative importance of various sources of evidence, also giving insight in the underlying classification problem. Third, a lean-and-mean system using only four features (text, title, descriptor glosses, descriptor term popularity) is able to perform at 97% of the large LTR model.

[1] Tomaz Erjavec,et al. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[2] Bruno Pouliquen,et al. Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[3] Ralf Steinberger,et al. JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool , 2012, LREC.

[4] M. de Rijke,et al. Hierarchical multi-label classification of social text streams , 2014, SIGIR.

[5] Tomaz Erjavec,et al. The JRC Collection of the ACQUIS Communautaire-A multilingual parallel corpus with 20+ languages , 2006 .

[6] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.

[7] Hang Li,et al. AdaRank: a boosting algorithm for information retrieval , 2007, SIGIR.

[8] Mirja Iivonen,et al. Consistency in the Selection of Search Concepts and Search Terms , 1995, Information Processing & Management.

[9] Johannes Fürnkranz,et al. Large-Scale Multi-label Text Classification - Revisiting Neural Networks , 2013, ECML/PKDD.

[10] Mirja Iivonen,et al. Consistency in the Selection of Search Concepts and Search Terms , 1995, Inf. Process. Manag..

[11] Yiming Yang,et al. Multilabel classification with meta-level features , 2010, SIGIR.

[12] Juho Rousu,et al. Kernel-Based Learning of Hierarchical Multilabel Classification Models , 2006, J. Mach. Learn. Res..