Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams

The name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal x Normal x Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.

[1]  William S. Rayens,et al.  Partially Pooled Covariance Matrix Estimation in Discriminant Analysis , 1989 .

[2]  ZhangJing,et al.  A Unified Probabilistic Framework for Name Disambiguation in Digital Library , 2012 .

[3]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[4]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[5]  Khushbu Agarwal,et al.  NOUS: Construction and Querying of Dynamic Knowledge Graphs , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[6]  J MillerDavid,et al.  A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets , 2003 .

[7]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[8]  Mohammad Al Hasan,et al.  Name disambiguation from link data in a collaboration graph , 2014, ASONAM.

[9]  Gerhard Weikum,et al.  Discovering emerging entities with ambiguous names , 2014, WWW.

[10]  Alfred O. Hero,et al.  Incremental Method for Spectral Clustering of Increasing Orders , 2015, ArXiv.

[11]  Mohammad Al Hasan,et al.  Name disambiguation from link data in a collaboration graph using temporal and topological features , 2014, Social Network Analysis and Mining.

[12]  Devdatt P. Dubhashi,et al.  Entity disambiguation in anonymized graphs using graph kernels , 2013, CIKM.

[13]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[14]  Madian Khabsa,et al.  Online Person Name Disambiguation with Constraints , 2015, JCDL.

[15]  David J. Miller,et al.  A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[17]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[18]  Mohammad Al Hasan,et al.  Trust from the past: Bayesian Personalized Ranking based Link Prediction in Knowledge Graphs , 2016, ArXiv.

[19]  Murat Dundar,et al.  Learning with a non-exhaustive training dataset: a case study: detection of bacteria cultures using optical-scattering technology , 2009, KDD.

[20]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[21]  Wagner Meira,et al.  Cost-effective on-demand associative author name disambiguation , 2012, Inf. Process. Manag..

[22]  Philip S. Yu,et al.  Discriminative frequent subgraph mining with optimality guarantees , 2010 .

[23]  Luo Si,et al.  Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion , 2013, SIGIR.

[24]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[25]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[26]  Philip S. Yu,et al.  ADANA: Active Name Disambiguation , 2011, 2011 IEEE 11th International Conference on Data Mining.

[27]  Wagner Meira,et al.  Named Entity Disambiguation in Streaming Data , 2012, ACL.

[28]  D. Aldous Exchangeability and related topics , 1985 .

[29]  Michela Becchi,et al.  Deploying Graph Algorithms on GPUs: An Adaptive Solution , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[30]  Vachik S. Dave,et al.  Feature Selection for Classification under Anonymity Constraint , 2015, Trans. Data Priv..

[31]  Murat Dundar,et al.  Bayesian Nonexhaustive Learning for Online Discovery and Modeling of Emerging Classes , 2012, ICML.

[32]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[33]  Qinghua Zheng,et al.  Dynamic author name disambiguation for growing digital libraries , 2015, Information Retrieval Journal.

[34]  Marcos André Gonçalves,et al.  Incremental Unsupervised Name Disambiguation in Cleaned Digital Libraries , 2011, J. Inf. Data Manag..

[35]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[36]  Murat Dundar,et al.  A machine‐learning approach to detecting unknown bacterial serovars , 2010, Stat. Anal. Data Min..