A Big Data architecture for knowledge discovery in PubMed articles

The need of smart information retrieval systems is in contrast with the difficulties to deal with huge amount of data. In this paper we present a Big Data Analytics architecture used to implement a semantic similarity search tool for natural language texts in biomedical domain. The implemented methodology is based on Word Embeddings (WEs) models obtained using the word2vec algorithm. The system has been assessed with documents extracted from the whole PubMed library. It will be also presented a user friendly web front-end able to assess the methodology on a real context.

[1]  Kai Zheng,et al.  Development and empirical user-centered evaluation of semantically-based query recommendation for an electronic health record search engine , 2017, Journal of Biomedical Informatics.

[2]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[3]  Matthias Samwald,et al.  Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation , 2015, ArXiv.

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  H. Krumholz Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. , 2014, Health affairs.

[6]  T. Murdoch,et al.  The inevitable application of big data to health care. , 2013, JAMA.

[7]  Anna Corazza,et al.  Topic Modelling with Word Embeddings , 2016, CLiC-it/EVALITA.

[8]  Anita Alicante,et al.  Semantic Cluster Labeling for Medical Relations , 2016 .

[9]  Salvatore Venticinque,et al.  Personalized Recommendation of Semantically Annotated Media Contents , 2013, IDC.

[10]  David Windridge,et al.  A Kernel-Based Framework for Medical Big-Data Analytics , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[11]  Anita Alicante,et al.  Unsupervised entity and relation extraction from clinical records in Italian , 2016, Comput. Biol. Medicine.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[14]  Mohammad-Reza Siadat,et al.  Extensible Query Framework for Unstructured Medical Data -- A Big Data Approach , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[15]  Yanqing Zhang,et al.  Using Word2Vec to process big text data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Fatiha Sadat,et al.  Efficient natural language pre-processing for analyzing large data sets , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[18]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[19]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Paloma Martínez,et al.  Exploring Word Embedding for Drug Name Recognition , 2015, Louhi@EMNLP.

[21]  Markus Forsberg,et al.  Mining semantics for culturomics: towards a knowledge-based approach , 2013, UnstructureNLP@CIKM.

[22]  Sophia Ananiadou,et al.  Text Mining for Semantic Search in Europe PubMed Central Labs , 2016 .

[23]  Flora Amato,et al.  Combining Syntactic and Semantic Vector Space Models in the Health Domain by using a Clustering Ensemble , 2013, HEALTHINF.