Discovering Mis-Categorized Entities

Entity categorization – the process of grouping entities into categories for some specific purpose – is an important problem with a great many applications, such as Google Scholar and Amazon products. Unfortunately, in practice, many entities are mis-categorized. In this paper, we study the problem of discovering mis-categorized entities from a given group of entities. This problem is inherently hard: all entities within the same group have been "well" categorized by state-of-the-art solutions. Apparently, it is nontrivial to differentiate them. We propose a novel rule-based framework to solve this problem. It first uses positive rules to compute disjoint partitions of entities, where the partition with the largest size is taken as the correctly categorized partition, namely the pivot partition. It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition. We describe optimizations on applying these rules, and discuss how to generate positive/negative rules. Extensive experimental results on two real-world datasets show the effectiveness of our solution.

[1]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[2]  Yang Li,et al.  Mining evidences for named entity disambiguation , 2013, KDD.

[3]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[4]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[5]  Paolo Papotti,et al.  Synthesizing Entity Matching Rules by Examples , 2017, Proc. VLDB Endow..

[6]  Guoliang Li,et al.  Cleaning Your Wrong Google Scholar Entries , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[7]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[8]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[9]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[10]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[11]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[12]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[13]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[14]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[15]  Samhaa R. El-Beltagy,et al.  A Survey of Ontology Learning Approaches , 2011 .

[16]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[17]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[18]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[19]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[20]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.