A generalized topic modeling approach for automatic document annotation

Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

[1]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[2]  Conrad S. Tucker Fad or Here to Stay: Predicting Product Market Adoption and Longevity Using Large Scale, Social Media Data DETC2013-12661 , 2013 .

[3]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[4]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  John Kunze,et al.  DataONE: Data Observation Network for Earth - Preserving Data and Enabling Innovation in the Biological and Environmental Sciences , 2011, D Lib Mag..

[6]  Michael R. Lyu,et al.  UserRec: A User Recommendation Framework in Social Tagging Systems , 2010, AAAI.

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Marcel Salathé,et al.  Discovering health-related knowledge in social media using ensembles of heterogeneous features , 2013, CIKM.

[9]  M. de Rijke,et al.  Linking Archives Using Document Enrichment and Term Selection , 2011, TPDL.

[10]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[11]  Yang Song,et al.  Real-time automatic tag recommendation , 2008, SIGIR '08.

[12]  C. Lee Giles,et al.  Automatic tag recommendation for metadata annotation using probabilistic topic modeling , 2013, JCDL '13.

[13]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[14]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[15]  Dominic Widdows,et al.  Semantic Vectors: a Scalable Open Source Package and Online Technology Management Application , 2008, LREC.

[16]  Gilad Mishne,et al.  AutoTag: a collaborative approach to automated tag assignment for weblog posts , 2006, WWW '06.

[17]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[18]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[19]  Yasushi Sakurai,et al.  Online multiscale dynamic topic models , 2010, KDD.

[20]  Conrad S. Tucker,et al.  Discovering Next Generation Product Innovations by Identifying Lead User Preferences Expressed Through Large Scale Social Media Data , 2014 .

[21]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[22]  Ralf Krestel,et al.  Latent dirichlet allocation for tag recommendation , 2009, RecSys '09.

[23]  Jeffery S. Horsburgh,et al.  ONEMercury: Towards Automatic Annotation of Environmental Science Metadata , 2012, LISC@ISWC.

[24]  Zhiyuan Liu,et al.  A Simple Word Trigger Method for Social Tag Suggestion , 2011, EMNLP.

[25]  Marcel Salathé,et al.  An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages , 2014, J. Biomed. Informatics.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Prasenjit Mitra,et al.  Utilizing Context in Generative Bayesian Models for Linked Corpus , 2010, AAAI.

[28]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.

[29]  Hector Garcia-Molina,et al.  Social tag prediction , 2008, SIGIR '08.

[30]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[31]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[32]  Conrad S. Tucker,et al.  Quantifying Product Favorability and Extracting Notable Product Features Using Large Scale Social Media Data , 2015, J. Comput. Inf. Sci. Eng..

[33]  Padhraic Smyth,et al.  Subject metadata enrichment using statistical topic models , 2007, JCDL '07.

[34]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[35]  Conrad S. Tucker,et al.  Automated Discovery of Lead Users and Latent Product Features by Mining Large Scale Social Media Networks , 2015 .

[36]  Nenghai Yu,et al.  Learning to tag , 2009, WWW '09.

[37]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.