Are Abstracts Enough for Hypothesis Generation?

The potential for automatic hypothesis generation (HG) systems to improve research productivity keeps pace with the growing set of publicly available scientific information. But as data becomes easier to acquire, we must understand the effect different textual data sources have on our resulting hypotheses. Are abstracts enough for HG, or does it need full-text papers? How many papers does an HG system need to make valuable predictions? How sensitive is a general-purpose HG system to hyperparameter values or input quality? What effect does corpus size and document length have on HG results? To answer these questions we train multiple versions of knowledge network-based HG system, MOLIERE, on varying corpora in order to compare challenges and trade offs in terms of result quality and computational requirements. MOLIERE generalizes main principles of similar knowledge network-based HG systems and reinforces them with topic modeling components. The corpora include the abstract and full-text versions of PubMed Central, as well as iterative halves of MEDLINE, which allows us to compare the effect document length and count has on the results. We find that, quantitatively, corpora with a higher median document length result in marginally higher quality results, yet require substantially longer to process. However, qualitatively, full-length papers introduce a significant number of intruder terms to the resulting topics, which decreases human interpretability. Additionally, we find that the effect of document length is greater than that of document count, even if both sets contain only paper abstracts.Reproducibility: Our code and data are available online at sybrandt.com/2018/abstracts.

[1]  Peter J. Haas,et al.  Automated hypothesis generation based on mining scientific literature , 2014, KDD.

[2]  Wanda Pratt,et al.  H.3.3 Information Search and Retrieval , 2022 .

[3]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[4]  Ilya Safro,et al.  Validation and Topic-driven Ranking for Biomedical Hypothesis Generation Systems , 2018, bioRxiv.

[5]  Scott Spangler,et al.  Artificial intelligence in neurodegenerative disease research : use of IBM Watson to identify additional RNA ‐ binding proteins , 2017 .

[6]  Jiawei Han,et al.  Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[7]  Andrey Rzhetsky,et al.  DiseaseConnect: a comprehensive web server for mechanism-based disease–disease connections , 2014, Nucleic Acids Res..

[8]  Jacob Ratkiewicz,et al.  Predicting the Political Alignment of Twitter Users , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[9]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[10]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[13]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[14]  Scott Spangler,et al.  Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation , 2015 .

[15]  Mauro Bittencourt Dos Santos,et al.  The textual organization of research paper abstracts in applied linguistics , 1996 .

[16]  J. Qiu,et al.  Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA , 2011, PloS one.

[17]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[18]  Søren Brunak,et al.  A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts , 2018, PLoS Comput. Biol..

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Ilya Safro,et al.  MOLIERE: Automatic Biomedical Hypothesis Generation System , 2017, KDD.

[21]  D. Swanson,et al.  Linking estrogen to Alzheimer's disease , 1996, Neurology.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Bradley Voytek,et al.  Automated cognome construction and semi-automated hypothesis generation , 2012, Journal of Neuroscience Methods.

[24]  N R Smalheiser,et al.  Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. , 1998, Computer methods and programs in biomedicine.

[25]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[26]  Bonnie L. Webber,et al.  Classification from Full Text: A Comparison of Canonical Sections of Scientific Papers , 2004, NLPBA/BioNLP.

[27]  Xiaohua Hu,et al.  A semantic-based approach for mining undiscovered public knowledge from biomedical literature , 2005, 2005 IEEE International Conference on Granular Computing.

[28]  Marc Weeber,et al.  Literature-based Discovery , 2008 .

[29]  Zhiyuan Liu,et al.  PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing , 2011, TIST.

[30]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[31]  Padmini Srinivasan,et al.  Text mining: Generating hypotheses from MEDLINE , 2004, J. Assoc. Inf. Sci. Technol..

[32]  Marc Weeber,et al.  Using concepts in literature-based discovery: Simulating Swanson's Raynaud-fish oil and migraine-magnesium discoveries , 2001, J. Assoc. Inf. Sci. Technol..

[33]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[34]  Marco R. Spruit,et al.  Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[35]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[36]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[37]  Halil Kilicoglu,et al.  SemMedDB: a PubMed-scale repository of biomedical semantic predications , 2012, Bioinform..

[38]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[39]  Ilya Safro,et al.  Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[40]  D. Swanson Undiscovered Public Knowledge , 1986 .