Semantic Search-by-Examples for Scientific Topic Corpus Expansion in Digital Libraries

In this article we address the problem of expanding the set of papers that researchers encounter when conducting bibliographic research on their scientific work. Using classical search engines or recommender systems in digital libraries, some interesting and relevant articles could be missed if they do not contain the same search key-phrases that the researcher is aware of. We propose a novel model that is based on a supervised active learning over a semantic features transformation of all articles of a given digital library. Our model, named Semantic Search-by-Examples (SSbE), shows better evaluation results over a similar purpose existing method, More-Like-This query, based on the feedback annotation of two domain experts in our experimented use-case. We also introduce a new semantic relatedness evaluation measure to avoid the need of human feedback annotation after the active learning process. The results also show higher diversity and overlapping with related scientific topics which we think can better foster transdisciplinary research.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Yelong Shen,et al.  Learning semantic representations using convolutional neural networks for web search , 2014, WWW.

[3]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[6]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[7]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[8]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[9]  Michael Gleicher,et al.  Serendip: Topic model-driven visual exploration of text corpora , 2014, 2014 IEEE Conference on Visual Analytics Science and Technology (VAST).

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Hady Wirawan Lauw,et al.  Semantic Visualization with Neighborhood Graph Regularization , 2016, J. Artif. Intell. Res..

[12]  Matthias Hagen,et al.  Supporting More-Like-This Information Needs: Finding Similar Web Content in Different Scenarios , 2014, CLEF.

[13]  Daniele Bonadiman,et al.  Convolutional Neural Networks vs. Convolution Kernels: Feature Engineering for Answer Sentence Reranking , 2016, NAACL.

[14]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[15]  Jian Pei,et al.  Detecting topic evolution in scientific literature: how can citations help? , 2009, CIKM.

[16]  Jacob Eisenstein,et al.  Exploratory Thematic Analysis for Digitized Archival Collections , 2015, Digit. Scholarsh. Humanit..

[17]  Xinbing Wang,et al.  Text Network Exploration via Heterogeneous Web of Topics , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[18]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[19]  Michael Tlauka Orientation dependent mental representations following real-world navigation. , 2006, Scandinavian journal of psychology.

[20]  Alessandro Moschitti,et al.  Modeling Relational Information in Question-Answer Pairs with Convolutional Neural Networks , 2016, ArXiv.

[21]  Abdullah Abrizah,et al.  LIS journals scientific impact and subject categorization: a comparison between Web of Science and Scopus , 2012, Scientometrics.

[22]  Tong Zhang,et al.  Fundamentals of Predictive Text Mining , 2010, Texts in Computer Science.

[23]  Manabu Honda,et al.  Cross-modal integration and plastic changes revealed by lip movement, random-dot motion and sign languages in the hearing and deaf. , 2005, Cerebral cortex.

[24]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[25]  Bela Gipp,et al.  Research-paper recommender systems: a literature survey , 2015, International Journal on Digital Libraries.

[26]  C. Loan Generalizing the Singular Value Decomposition , 1976 .

[27]  Fabrice Muhlenbach,et al.  UdL at SemEval-2017 Task 1: Semantic Textual Similarity Estimation of English Sentence Pairs Using Regression Model over Pairwise Features , 2017, SemEval@ACL.

[28]  Yi Li,et al.  An Interactive Information-Retrieval Method Based on Active Learning , 2017 .

[29]  Ananth Grama,et al.  Data Mining: From Serendipity to Science - Guest Editors' Introduction , 1999, Computer.

[30]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[31]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[32]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[33]  Padhraic Smyth,et al.  TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling , 2012, TIST.