Structural Regularities in Text-based Entity Vector Spaces

Entity retrieval is the task of finding entities such as people or products in response to a query, based solely on the textual documents they are associated with. Recent semantic entity retrieval algorithms represent queries and experts in finite-dimensional vector spaces, where both are constructed from text sequences. We investigate entity vector spaces and the degree to which they capture structural regularities. Such vector spaces are constructed in an unsupervised manner without explicit information about structural aspects. For concreteness, we address these questions for a specific type of entity: experts in the context of expert finding. We discover how clusterings of experts correspond to committees in organizations, the ability of expert representations to encode the co-author graph, and the degree to which they encode academic rank. We compare latent, continuous representations created using methods based on distributional semantics (LSI), topic models (LDA) and neural networks (word2vec, doc2vec, SERT). Vector spaces created using neural methods, such as doc2vec and SERT, systematically perform better at clustering than LSI, LDA and word2vec. When it comes to encoding entity relations, SERT performs best.

[1]  M. de Rijke,et al.  Learning Latent Vector Spaces for Product Search , 2016, CIKM.

[2]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[3]  Marcel Worring,et al.  Unsupervised, Efficient and Semantic Expertise Retrieval , 2016, WWW.

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Peter Mika,et al.  Ad-hoc object retrieval in the web of data , 2010, WWW '10.

[6]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[7]  Krisztian Balog,et al.  Overview of the TREC 2010 Entity Track , 2010, TREC.

[8]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[9]  Wolfgang Nejdl,et al.  A Vector Space Model for Ranking Entities and Its Application to Expert Search , 2009, ECIR.

[10]  Nick Craswell,et al.  Overview of the TREC 2005 Enterprise Track , 2005, TREC.

[11]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[12]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[13]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Luo Si,et al.  Discriminative models of integrating document evidence and document-candidate associations for expert search , 2010, SIGIR '10.

[16]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[17]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[18]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[19]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[20]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[21]  Houfeng Wang,et al.  Learning Entity Representation for Entity Disambiguation , 2013, ACL.

[22]  Hang Li,et al.  Semantic Matching in Search , 2014, SMIR@SIGIR.

[23]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[24]  M. de Rijke,et al.  Determining Expert Profiles (With an Application to Expert Finding) , 2007, IJCAI.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Jianfeng Gao,et al.  Deep stacking networks for information retrieval , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[28]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[29]  M. de Rijke,et al.  Finding similar experts , 2007, SIGIR.

[30]  M. de Rijke,et al.  Expertise Retrieval , 2012, Found. Trends Inf. Retr..

[31]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[32]  Zhiyuan Liu,et al.  Representation Learning for Measuring Entity Relatedness with Rich Information , 2015, IJCAI.

[33]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[36]  Edgar Meij,et al.  Utilizing Knowledge Bases in Text-centric Information Retrieval , 2016, ICTIR.

[37]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[38]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[39]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[40]  Maarten de Rijke,et al.  Dynamic Collective Entity Representations for Entity Ranking , 2016, WSDM '16.

[41]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[42]  M. de Rijke,et al.  Information Processing and Management Investigating Queries and Search Failures in Academic Search , 2022 .

[43]  David van Dijk,et al.  Early Detection of Topical Expertise in Community Question Answering , 2015, SIGIR.

[44]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[45]  Jason Weston,et al.  Learning Structured Embeddings of Knowledge Bases , 2011, AAAI.

[46]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[47]  Thomas H. Davenport,et al.  Book review:Working knowledge: How organizations manage what they know. Thomas H. Davenport and Laurence Prusak. Harvard Business School Press, 1998. $29.95US. ISBN 0‐87584‐655‐6 , 1998 .

[48]  M. de Rijke,et al.  Determining the Presence of Political Parties in Social Circles , 2015, ICWSM.

[49]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[50]  Maarten de Rijke,et al.  Semantic Entity Retrieval Toolkit , 2017, ArXiv.

[51]  Christopher D. Manning,et al.  Improving Coreference Resolution by Learning Entity-Level Distributed Representations , 2016, ACL.

[52]  Michael J. Pazzani,et al.  Content-Based Recommendation Systems , 2007, The Adaptive Web.

[53]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[54]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[55]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.