Canonicalizing Knowledge Bases for Recruitment Domain

Online recruitment industry holds large amount of user-generated content in the form of job postings, resumes etc. This content finds its way in the knowledge bases (KB) causing duplicate and non-standard representations of entities (like company names, institute names, designations, skills etc.) These non-standard entity representations impact various applications such as search, recommendations and information retrieval. Therefore, KB canonicalization i.e, mapping multiple references of same entities into unique clusters is imperative for online recruitment platforms. Research suggests various approaches that use enriched semantic context or external context (from sources like Freebase) to perform KB Canonicalization. In fields where such external sources of context do not exist the problem remains challenging. To address these challenges, we propose a novel deep Siamese architecture with character-based attention and word embeddings that (a) estimates pairwise similarity between all entity mentions, and (b) then uses these similarity (scores) to create canonical clusters representing unique entity in the KB. Our experiments on recruitment domain dataset comprising of 62,288 unique entities of various types such as companies, institutes, skills, and designations demonstrate the effectiveness of our approach. We also provide insights on different network architectures, each of which encapsulate a different set of variation while performing canonicalization.

[1]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[2]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[3]  Mark Dredze,et al.  Entity Disambiguation for Knowledge Base Population , 2010, COLING.

[4]  Hakan Kardes,et al.  Graph-based Approaches for Organization Entity Resolution in MapReduce , 2013, TextGraphs@EMNLP.

[5]  Oren Etzioni,et al.  Entity Linking at Web Scale , 2012, AKBC-WEKEX@NAACL-HLT.

[6]  Salvatore Orlando,et al.  Dexter 2.0 - an Open Source Tool for Semantically Enriching Data , 2014, International Semantic Web Conference.

[7]  Jian Su,et al.  Entity Linking Leveraging Automatically Generated Annotation , 2010, COLING.

[8]  Xiaolong Wang,et al.  Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation , 2015, IJCAI.

[9]  Dan Roth,et al.  Entity Linking via Joint Encoding of Types, Descriptions, and Context , 2017, EMNLP.

[10]  Faizan Javed,et al.  CompanyDepot: Employer Name Normalization in the Online Recruitment Industry , 2016, KDD.

[11]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[12]  Niharika Sachdeva,et al.  Canonicalizing Organization Names for Recruitment Domain , 2020, COMAD/CODS.

[13]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[14]  Fabian M. Suchanek,et al.  Canonicalizing Open Knowledge Bases , 2014, CIKM.

[15]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[16]  Gerhard Weikum,et al.  Discovering emerging entities with ambiguous names , 2014, WWW.

[17]  Lise Getoor,et al.  Knowledge Graph Identification , 2013, SEMWEB.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[20]  Partha Talukdar,et al.  CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information , 2018, WWW.

[21]  Maarten Versteegh,et al.  Learning Text Similarity with Siamese Recurrent Networks , 2016, Rep4NLP@ACL.

[22]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[23]  Houfeng Wang,et al.  Learning Entity Representation for Entity Disambiguation , 2013, ACL.

[24]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[25]  Malay K. Pakhira,et al.  A Linear Time-Complexity k-Means Algorithm Using Cluster Shifting , 2014, 2014 International Conference on Computational Intelligence and Communication Networks.

[26]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[27]  Dan Klein,et al.  Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks , 2016, NAACL.