Document Ranking for Curated Document Databases Using BERT and Knowledge Graph Embeddings: Introducing GRAB-Rank

Curated Document Databases (CDD) play an important role in helping researchers find relevant articles in scientific literature. Considerable recent attention has been given to the use of various document ranking algorithms to support the maintenance of CDDs. The typical approach is to represent the update document collection using a form of word embedding and to input this into a ranking model; the resulting document rankings can then be used to decide which documents should be added to the CDD and which should be rejected. The hypothesis considered in this paper is that a better ranking model can be produced if a hybrid embedding is used. To this end the Knowledge Graph And BERT Ranking (GRAB-Rank) approach is presented. The Online Resource for Recruitment research in Clinical trials (ORRCA) CDD was used as a focus for the work and as a means of evaluating the proposed technique. The GRAB-Rank approach is fully described and evaluated in the context of learning to rank for the purpose of maintaining CDDs. The evaluation indicates that the hypothesis is correct, hybrid embedding outperforms individual embeddings used in isolation. The evaluation also indicates that GRAB-Rank outperforms a traditional approach based on BM25 and and a ngram-based SVR document ranking approach.

[1]  James P. Callan,et al.  Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding , 2017, WWW.

[2]  Mandar Mitra,et al.  Information Retrieval from Documents: A Survey , 2000, Information Retrieval.

[3]  Jiaul H. Paik A novel TF-IDF weighting scheme for effective ranking , 2013, SIGIR.

[4]  Qi Chen,et al.  BISON: BM25-weighted Self-Attention Framework for Multi-Fields Document Search , 2020, ArXiv.

[5]  Aurélie Névéol,et al.  Evaluation of an automatic article selection method for timelier updates of the Comet Core Outcome Set database , 2019, Database J. Biol. Databases Curation.

[6]  Feng Li,et al.  Exploring the Importance of Entities in Semantic Ranking , 2019, Inf..

[7]  Bhaskar Mitra,et al.  Improving Document Ranking with Dual Word Embeddings , 2016, WWW.

[8]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Taoufiq Gadi,et al.  Ranking of text documents using TF-IDF weighting and association rules mining , 2018, 2018 4th International Conference on Optimization and Applications (ICOA).

[11]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[12]  Zhiyuan Liu,et al.  Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search , 2018, WSDM.

[13]  D. Menzies,et al.  Systematic reviews and meta-analyses , 2019, Practical Biostatistics.

[14]  Zhiyuan Liu,et al.  Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval , 2018, ACL.

[15]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[16]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[17]  W. Bruce Croft,et al.  Investigating the Successes and Failures of BERT for Passage Re-Ranking , 2019, ArXiv.

[18]  Gerhard Weikum,et al.  YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames , 2016, SEMWEB.

[19]  K. Pearson The Problem of the Random Walk , 1905, Nature.

[20]  William J. Cragg,et al.  Development of an online resource for recruitment research in clinical trials to organise and map current literature , 2018, Clinical trials.

[21]  Frans Coenen,et al.  Open Information Extraction for Knowledge Graph Construction , 2020, DEXA Workshops.

[22]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[23]  Nazli Goharian,et al.  CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[24]  Ido Dagan,et al.  Supervised Open Information Extraction , 2018, NAACL.

[25]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[26]  Faezeh Ensan,et al.  Neural word and entity embeddings for ad hoc retrieval , 2018, Inf. Process. Manag..

[27]  Rayleigh The Problem of the Random Walk , 1905, Nature.

[28]  Danushka Bollegala,et al.  Maintaining Curated Document Databases Using a Learning to Rank Model: The ORRCA Experience , 2020, SGAI Conf..

[29]  W. Bruce Croft,et al.  Relevance-based Word Embedding , 2017, SIGIR.