Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts

In this paper, we apply an unsupervised word sense discrim- ination technique based on clustering similar contexts (Purandare and Pedersen, 2004) to the problems of name discrimination and email clus- tering. Names of people, places, and organizations are not always unique. This can create a problem when we refer to or seek out information about such entities. When this occurs in written text, we show that we can clus- ter ambiguous names into unique groups by identifying which contexts are similar to each other. It has been previously shown by (Pedersen, Pu- randare, and Kulkarni, 2005) that this approach can be successfully used for discrimination of names with two-way ambiguity. Here we show that it can be extended to multiway distinctions as well. We adapt the clus- ter labeling technique introduced by (Kulkarni, 2005) for the multiway distinctions of name discrimination. On the similar lines of contextual similarity, we also observe that email messages can be treated as con- texts, and that in clustering them together we are able to group them based on their underlying content rather than the occurrence of speciflc strings.

[1]  J. M. Cohen,et al.  Mexico City : México , 1965 .

[2]  Patrick Pantel,et al.  Automatically Labeling Semantic Classes , 2004, NAACL.

[3]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[4]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[5]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[6]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[7]  Ted Pedersen,et al.  Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[8]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[9]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[10]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[11]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[12]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[13]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[14]  James Munkres On the assignment and transportation problems (abstract) , 1957 .

[15]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[16]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[17]  Anagha Kulkarni Unsupervised Discrimination and Labeling of Ambiguous Names , 2005, ACL.

[18]  Tessa A. Lau,et al.  Automated email activity management: an unsupervised learning approach , 2005, IUI.

[19]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[20]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[21]  Ted Pedersen,et al.  Name Discrimination by Clustering Similar Contexts , 2005, CICLing.

[22]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.