A tale of two paradigms: disambiguating extracted entities with applications to a digital library and the web

With the increasing wealth of information on the Web, information integration is ubiquitous as the same real-world entity may appear in a variety of forms extracted from different sources. This dissertation proposes supervised and unsupervised algorithms that are naturally integrated in a scalable framework to solve the entity resolution problem, which lies at the heart of the information integration process. This dissertation focuses on two incarnations of the entity resolution problem that arise in the data mining and natural language processing areas. First, name disambiguation occurs when one is seeking a list of publications of an author in a digital library, who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework that disambiguates the extracted author metadata from paper headers in a divide-and-conquer fashion: based on the metadata records extracted from paper headers, a blocking method retrieves candidate classes of authors with similar names and a density-based clustering method, DBSCAN, clusters the records by author. The distance metric between papers used for clustering is calculated by an online active selection Support Vector Machines algorithm LASVM. We prove that by recasting transitivity as density connectivity in DBSCAN, transitivity is guaranteed for core points. The method achieves high accuracy on a manually labeled dataset and readily disambiguates about a million author metadata records in CiteSeer, which paves the way for the fielded search by author name feature in CiteSeer X. Second, as a key step towards document understanding in natural language processing, we investigate the problem of cross document coreference (CDC), which aims to decipher the true reference of a named entity across the boundary of documents. This dissertation presents a novel cross document coreference approach that leverages the profiles of entities which are constructed by information extraction tools and reconciled using a within-document coreference module. We propose to match the profiles by using a learned ensemble distance function comprised of a suite of similarity specialists. We develop a kernelized soft relational clustering algorithm that makes use of the learned distance function to partition the entities into fuzzy sets of identities. Evaluation on a large benchmark collection shows that the proposed methods achieve competitive coreference results. We further discuss the details of the implementation of the CDC and web person search system. This dissertation surveys the literature on author name disambiguation in citations and paper headers, citation matching and cross document coreference. Additionally, we explore the social networks of the disambiguated authors, performing a comprehensive study of the network and community level characteristics and proposing a stochastic model to predict collaborations of individuals.

[1]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[2]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[5]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[7]  Ralph Grishman Whither Written Language Evaluation? , 1994, HLT.

[8]  Breck Baldwin,et al.  University of Pennsylvania: description of the University of Pennsylvania system used for MUC-6 , 1995, MUC.

[9]  Sheldon M. Ross Introduction to Probability Models. , 1995 .

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Brian Randell,et al.  An Assessment of Name Matching Algorithms , 1996 .

[12]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[13]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[14]  F. Stokman Evolution of social networks , 1997 .

[15]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[16]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[17]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[18]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[19]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[20]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[21]  H E Stanley,et al.  Classes of small-world networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[24]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[25]  M. Newman Clustering and preferential attachment in growing networks. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[28]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[29]  M E Newman,et al.  Scientific collaboration networks. I. Network construction and fundamental results. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[31]  M. Newman,et al.  Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[32]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[33]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[34]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[35]  Rajesh N. Davé,et al.  Robust fuzzy clustering of relational data , 2002, IEEE Trans. Fuzzy Syst..

[36]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[37]  M. Newman,et al.  Mixing patterns in networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[38]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[39]  M. Newman,et al.  Why social networks are different from other types of networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[40]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[41]  Luís A. Nunes Amaral,et al.  Sexual networks: implications for the transmission of sexually transmitted infections. , 2003, Microbes and infection.

[42]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation , 2005, J. Assoc. Inf. Sci. Technol..

[43]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[44]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[45]  Amin Saberi,et al.  Exploring the community structure of newsgroups , 2004, KDD.

[46]  Dan Roth,et al.  Robust Reading: Identification and Tracing of Ambiguous Names , 2004, NAACL.

[47]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[48]  Andrew McCallum,et al.  An Integrated, Conditional Model of Information Extraction and Coreference with Appli , 2004, UAI.

[49]  S. Redner Citation Statistics From More Than a Century of Physical Review , 2004, physics/0407137.

[50]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[51]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[52]  M. Newman Coauthorship networks and patterns of scientific collaboration , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Cheng Niu,et al.  Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction , 2004, ACL.

[54]  S.M. Taylor,et al.  Deciphering human language [information extraction] , 2004, IT Professional.

[55]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[56]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[57]  Dongwon Lee,et al.  On six degrees of separation in DBLP-DB and more , 2005, SGMD.

[58]  Hui Han,et al.  Name disambiguation in author citations using a K-way spectral clustering method , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[59]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[60]  Dan Roth,et al.  Semantic Integration in Text: From Ambiguous Names to Identifiable Entities , 2005, AI Mag..

[61]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[62]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.

[63]  Byung-Won On,et al.  Effective and scalable solutions for mixed and split citation problems in digital libraries , 2005, IQIS '05.

[64]  Beatrice Lazzerini,et al.  A new fuzzy relational clustering algorithm based on the fuzzy C-means algorithm , 2005, Soft Comput..

[65]  Gueorgi Kossinets,et al.  Empirical Analysis of an Evolving Social Network , 2006, Science.

[66]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[67]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[68]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[69]  Lise Getoor,et al.  Social Capital in Friendship-Event Networks , 2006, Sixth International Conference on Data Mining (ICDM'06).

[70]  Xiang Ji,et al.  Topic evolution and social interactions: how authors effect research , 2006, CIKM '06.

[71]  Mehran Sahami,et al.  Mining the Web to Determine Similarity Between Words, Objects, and Communities , 2006, FLAIRS.

[72]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[73]  Weixiong Zhang,et al.  Identification and Evaluation of Weak Community Structures in Networks , 2006, AAAI.

[74]  Dongwon Lee,et al.  Search engine driven author disambiguation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[75]  Ravi Kumar,et al.  Structure and evolution of online social networks , 2006, KDD '06.

[76]  Beatrice Lazzerini,et al.  A novel approach to fuzzy clustering based on a dissimilarity relation extracted from data using a TS system , 2006, Pattern Recognit..

[77]  Hongyuan Zha,et al.  Co-ranking Authors and Documents in a Heterogeneous Network , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[78]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[79]  Juan-Zi Li,et al.  A Unified Probabilistic Framework for Name Disambiguation in Digital Library , 2012, IEEE Transactions on Knowledge and Data Engineering.

[80]  John Yen,et al.  Probabilistic Community Discovery Using Hierarchical Latent Gaussian Mixture Model , 2007, AAAI.

[81]  Horacio Saggion SHEF: Semantic Tagging and Summarization Techniques Applied to Cross-document Coreference , 2007, SemEval@ACL.

[82]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[83]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[84]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[85]  Andrew McCallum,et al.  Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function , 2007 .

[86]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[87]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[88]  Byung-Won On,et al.  Are your citations clean? , 2007, CACM.

[89]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[90]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[91]  Jia Li,et al.  Extracting Author Meta-Data from Web Using Visual Features , 2007 .

[92]  Min-Yen Kan,et al.  PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features , 2007, SemEval@ACL.

[93]  Christopher Joseph Pal,et al.  Improving Author Coreference by Resource-Bounded Information Gathering from the Web , 2007, IJCAI.

[94]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[95]  Dmitri V. Kalashnikov,et al.  Web People Search via Connection Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[96]  Andrew McCallum,et al.  A unified approach for schema matching, coreference and canonicalization , 2008, KDD.

[97]  C. Lee Giles,et al.  Collaboration over time: characterizing and modeling network evolution , 2008, WSDM '08.

[98]  Ravi Kumar,et al.  Vanity fair: privacy in querylog bundles , 2008, CIKM '08.

[99]  Jian Huang,et al.  On updates that constrain the features' connections during learning , 2008, KDD.

[100]  C. Lee Giles,et al.  Error-driven generalist+experts (edge): a multi-stage ensemble framework for text categorization , 2008, CIKM '08.

[101]  Alex Baron,et al.  Who is Who and What is What: Experiments in Cross-Document Co-Reference , 2008, EMNLP.

[102]  Berthier A. Ribeiro-Neto,et al.  Using web information for creating publication venue authority files , 2008, JCDL '08.

[103]  Jian Su,et al.  An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming , 2008, ACL.

[104]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[105]  C. Lee Giles,et al.  Solving the "Who's Mark Johnson Puzzle": Information Extraction Based Cross Document Coreference , 2009, HLT-NAACL.

[106]  C. Lee Giles,et al.  Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering , 2009, ACL/IJCNLP.

[107]  Berthier A. Ribeiro-Neto,et al.  Using web information for author name disambiguation , 2009, JCDL '09.

[108]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[109]  Xianpei Han,et al.  Named entity disambiguation by leveraging wikipedia semantic knowledge , 2009, CIKM.

[110]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..