A Comparison of Different Strategies for Automated Semantic Document Annotation

We introduce a framework for automated semantic document annotation that is composed of four processes, namely concept extraction, concept activation, annotation selection, and evaluation. The framework is used to implement and compare different annotation strategies motivated by the literature. For concept extraction, we apply entity detection with semantic hierarchical knowledge bases, Tri-gram, RAKE, and LDA. For concept activation, we compare a set of statistical, hierarchy-based, and graph-based methods. For selecting annotations, we compare top-k as well as kNN. In total, we define 43 different strategies including novel combinations like using graph-based activation with kNN. We have evaluated the strategies using three different datasets of varying size from three scientific disciplines (economics, politics, and computer science) that contain 100, 000 manually labeled documents in total. We obtain the best results on all three datasets by our novel combination of entity detection with graph-based activation (e.g., HITS and Degree) and kNN. For the economic and political science datasets, the best F-measure is .39 and .28, respectively. For the computer science dataset, the maximum F-measure of .33 can be reached. The experiments are the by far largest on scholarly content annotation, which typically are up to a few hundred documents per dataset only.

[1]  Uzay Kaymak,et al.  News personalization using the CF-IDF semantic recommender , 2011, WIMS '11.

[2]  Ansgar Scherp,et al.  Generic process for extracting user profiles from social media using hierarchical knowledge bases , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[3]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[4]  Marek Hatala,et al.  Voting Theory for Concept Detection , 2012, ESWC.

[5]  Rui Wang,et al.  How Preprocessing Affects Unsupervised Keyphrase Extraction , 2014, CICLing.

[6]  Olena Medelyan,et al.  Human-competitive automatic topic indexing , 2009 .

[7]  Marcel Salathé,et al.  Discovering health-related knowledge in social media using ensembles of heterogeneous features , 2013, CIKM.

[8]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[9]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Eelco Herder,et al.  Extraction of Professional Interests from Social Web Profiles , 2011 .

[11]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[12]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[13]  Amit P. Sheth,et al.  User Interests Identification on Twitter Using a Hierarchical Knowledge Base , 2014, ESWC.

[14]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[15]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[16]  Stefanie N. Lindstaedt,et al.  Augmenting User Models with Real World Experiences to Enhance Personalization and Adaptation (aum) Co-located with the International Conference on User Modeling, Adaptation and Personalization Semantically Enriched Machine Learning Approach to Filter Youtube Comments for Socially Augmented User Mode , 2022 .

[17]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[18]  C. Lee Giles,et al.  Automatic tag recommendation for metadata annotation using probabilistic topic modeling , 2013, JCDL '13.

[19]  Zhiyong Lu,et al.  Recommending MeSH terms for annotating biomedical articles , 2011, J. Am. Medical Informatics Assoc..

[20]  Rada Mihalcea,et al.  Linguistically Motivated Features for Enhanced Back-of-the-Book Indexing , 2008, ACL.

[21]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[22]  Yukio Ohsawa,et al.  KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[23]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[24]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[25]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[26]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Anette Hulth Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction , 2004 .