Macau: Large-scale skill sense disambiguation in the online recruitment domain

Named entity sense disambiguation is a problem with important natural language processing applications. In the online recruitment industry, normalization and recognition of occupational skills play a key role in linking the right candidate with the right job. The disambiguation of multisense skills will help improve this normalization and recognition process. In this paper we discuss an automatic large-scale system to identify and disambiguate multi-sense skills, including: (1) Feature Selection: employing word embedding to quantify the skills and their contexts into vectors; (2) Clustering: applying Markov Chain Monte Carlo (MCMC) methods to aggregate vectors into clusters that represent respective senses; (3) Large-scale: implementing parallelization to process text blobs on a large-scale; (4) Pruning: cluster cleaning by analyzing intra-cluster cosine similarities. Based on experiments on sample datasets, the MCMC-based clustering algorithm outperforms other clustering algorithms for the disambiguation problem. Also based on data-driven in-house evaluations, our disambiguation system achieves 84% precision.

[1]  Kathleen McKeown,et al.  Improving Word Sense Disambiguation in Lexical Chaining , 2003, IJCAI.

[2]  Carlos Ordonez,et al.  A Clustering Algorithm Merging MCMC and EM Methods Using SQL Queries , 2014, BigMine.

[3]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[4]  Faizan Javed,et al.  WebScalding: A Framework for Big Data Web Services , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[5]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[6]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[7]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[8]  Sören Auer,et al.  AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data , 2014, International Semantic Web Conference.

[9]  Faizan Javed,et al.  SKILL: A System for Skill Identification and Normalization , 2015, AAAI.

[10]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[13]  Annalina Caputo,et al.  An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model , 2014, COLING.

[14]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[15]  Dan Jurafsky,et al.  Statistical Natural Language Processing , 2010, Encyclopedia of Machine Learning.

[16]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[17]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Faizan Javed,et al.  Carotene: A Job Title Classification System for the Online Recruitment Domain , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[19]  B. Stosic Pairwise clustering using a Monte Carlo Markov Chain , 2009 .

[20]  Eugénio C. Oliveira,et al.  An Approach to Web-Scale Named-Entity Disambiguation , 2009, MLDM.

[21]  Hyung Jin Kim,et al.  LinkedIn skills: large-scale topic extraction and inference , 2014, RecSys '14.

[22]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[23]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Andrew McCallum,et al.  Monte Carlo MCMC: Efficient Inference by Approximate Sampling , 2012, EMNLP.

[26]  S. Dreyfus,et al.  Thermodynamical Approach to the Traveling Salesman Problem : An Efficient Simulation Algorithm , 2004 .

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Ted Pedersen,et al.  A Decision Tree of Bigrams is an Accurate Predictor of Word Sense , 2001, NAACL.

[29]  Faizan Javed,et al.  sCooL: A system for academic institution name normalization , 2014, 2014 International Conference on Collaboration Technologies and Systems (CTS).

[30]  Rada Mihalcea,et al.  Multilingual Word Sense Disambiguation Using Wikipedia , 2013, IJCNLP.

[31]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.