A pipeline for extracting and deduplicating domain-specific knowledge bases

Building a knowledge base (KB) describing domain-specific entities is an important problem in industry, examples including KBs built over companies (e.g. Dun & Bradstreet), skills (LinkedIn, CareerBuilder) and people (inome). The task involves several engineering challenges, including devising effective procedures for data extraction, aggregation and deduplication. Data extraction involves processing multiple information sources in order to extract domain-specific data instances. The extracted instances must be aggregated and deduplicated; that is, instances referring to the same underlying entity must be identified and merged. This paper describes a pipeline developed at CareerBuilder LLC for building a KB describing employers, by first extracting entities from both global, publicly available data sources (Wikipedia and Freebase) and a proprietary source (Infogroup), and then deduplicating the instances to yield an employer-specific KB. We conduct a range of pilot experiments over three independently labeled datasets sampled from the extracted KB, and comment on some lessons learned.

[1]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[2]  Daniel P. Miranker,et al.  A two-step blocking scheme learner for scalable link discovery , 2014, OM.

[3]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[4]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[5]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Erik Cambria,et al.  Big Social Data Analysis , 2013 .

[7]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[8]  Gerhard Weikum,et al.  Knowledge harvesting in the big-data era , 2013, SIGMOD '13.

[9]  Faizan Javed,et al.  sCooL: A system for academic institution name normalization , 2014, 2014 International Conference on Collaboration Technologies and Systems (CTS).

[10]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[11]  Anand Rajaraman,et al.  Building, maintaining, and using knowledge bases: a report from the trenches , 2013, SIGMOD '13.

[12]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[13]  Faizan Javed,et al.  Carotene: A Job Title Classification System for the Online Recruitment Domain , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[14]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[15]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[16]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[17]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[18]  Siddhartha Jonnalagadda,et al.  NEMO: Extraction and normalization of organization names from PubMed affiliation strings , 2010, Journal of biomedical discovery and collaboration.

[19]  Ahmet Uyar,et al.  Evaluating search features of Google Knowledge Graph and Bing Satori: Entity types, list searches and query interfaces , 2015, Online Inf. Rev..

[20]  Bruce W. Suter,et al.  The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[21]  Felix Naumann,et al.  Schema matching using duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[23]  Siddhartha Jonnalagadda,et al.  NEMO: Extraction and normalization of organization names from PubMed affiliations , 2010, Journal of Biomedical Discovery and Collaboration.

[24]  Guangyuan Li Knowledge Discovery from Knowledge Bases with Higher-Order Logic , 2015 .

[25]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .