BioTopic: a topic-driven biological literature mining system

Biology and biomedicine are flourishing disciplines, with massive biological data produced in experiments and huge amount of research papers published in journals. In such a big data context, unsupervised data mining methods such as topic models are used to extract topics from large-scale document collections. In this paper, we present a biological literature mining system based on topic modelling BioTopic. Experiments show that the perplexity reduction percentage of our pre-processing method is 5% larger that of a traditional pre-processing method. The precision of our search performance reaches 86%, which is better that that of a unigram language model. Our method employs linguistic information from shallow parsing to better pre-process biological literature for topic models. BioTopic with fine-grained pre-processing and topic modelling works better than traditional literature mining systems.

[1]  Ke Xu,et al.  Mining meaningful topics from massive biomedical literature , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[2]  Xiaoqian Jiang,et al.  Text mining driven drug-drug interaction detection , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[3]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[4]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[5]  Timothy Baldwin,et al.  Automatic Labelling of Topic Models , 2011, ACL.

[6]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[7]  Mirella Lapata,et al.  Bayesian Word Sense Induction , 2009, EACL.

[8]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[9]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Jacob de Vlieg,et al.  Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases , 2010, PLoS Comput. Biol..

[12]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[13]  Deyi Xiong,et al.  A Topic-Based Coherence Model for Statistical Machine Translation , 2013, AAAI.

[14]  Laura Dietz,et al.  Inferring functional modules of protein families with probabilistic topic models , 2011, BMC Bioinformatics.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[17]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[18]  Xiaohua Hu,et al.  A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[20]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[21]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[22]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.