Automatic tag recommendation for metadata annotation using probabilistic topic modeling

The increase of the complexity and advancement in ecological and environmental sciences encourages scientists across the world to collect data from multiple places, times, and thematic scales to verify their hypotheses. Accumulated over time, such data not only increases in amount, but also in the diversity of the data sources spread around the world. This poses a huge challenge for scientists who have to manually search for information. To alleviate such problems, ONEMercury has recently been implemented as part of the DataONE project to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata from the data hosted by multiple repositories and makes it searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could affect effective retrieval. Here, we develop algorithms for automatic annotation of metadata. We transform the problem into a tag recommendation problem with a controlled tag library, and propose two variants of an algorithm for recommending tags. Our experiments on four datasets of environmental science metadata records not only show great promises on the performance of our method, but also shed light on the different natures of the datasets.

[1]  Hector Garcia-Molina,et al.  Social tag prediction , 2008, SIGIR '08.

[2]  Michael R. Lyu,et al.  UserRec: A User Recommendation Framework in Social Tagging Systems , 2010, AAAI.

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  Zhiyuan Liu,et al.  A Simple Word Trigger Method for Social Tag Suggestion , 2011, EMNLP.

[5]  John Kunze,et al.  DataONE: Data Observation Network for Earth - Preserving Data and Enabling Innovation in the Biological and Environmental Sciences , 2011, D Lib Mag..

[6]  Jeffery S. Horsburgh,et al.  ONEMercury: Towards Automatic Annotation of Environmental Science Metadata , 2012, LISC@ISWC.

[7]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[8]  Ralf Krestel,et al.  Latent dirichlet allocation for tag recommendation , 2009, RecSys '09.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Prasenjit Mitra,et al.  Utilizing Context in Generative Bayesian Models for Linked Corpus , 2010, AAAI.

[11]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  Gilad Mishne,et al.  AutoTag: a collaborative approach to automated tag assignment for weblog posts , 2006, WWW '06.

[13]  M. de Rijke,et al.  Linking Archives Using Document Enrichment and Term Selection , 2011, TPDL.

[14]  Yang Song,et al.  Real-time automatic tag recommendation , 2008, SIGIR '08.

[15]  Nenghai Yu,et al.  WWW 2009 MADRID! Track: Rich Media / Session: Tagging and Clustering Learning to , 2022 .

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Yasushi Sakurai,et al.  Online multiscale dynamic topic models , 2010, KDD.

[18]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[19]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.

[20]  Conrad S. Tucker Fad or Here to Stay: Predicting Product Market Adoption and Longevity Using Large Scale, Social Media Data DETC2013-12661 , 2013 .

[21]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[22]  Padhraic Smyth,et al.  Subject metadata enrichment using statistical topic models , 2007, JCDL '07.

[23]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[24]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[25]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.