Network Analysis Social Extraction Keyword names CRF Contact and Information Extraction Email Extraction Person

We present an end-to-end system that extracts a user’s social network and its members’ contact information given the user’s email inbox. The system identifies unique people in email, finds their Web presence, and automatically fills the fields of a contact address book using conditional random fields—a type of probabilistic model well-suited for such information extraction tasks. By recursively calling itself on new people discovered on the Web, the system builds a social network with multiple degrees of separation from the user. Additionally, a set of expertise-describing keywords are extracted and associated with each person. We outline the collection of statistical and learning components that enable this system, and present experimental results on the real email of two users; we also present results with a simple method of learning transfer, and discuss the capabilities of the system for addressbook population, expert-finding, and social network analysis.

[1]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[2]  Michael Riley,et al.  Speech Recognition by Composition of Weighted Finite Automata , 1996, ArXiv.

[3]  Vinayak R. Borkar,et al.  Automatically Extracting Structure from Free Text Addresses , 2000, IEEE Data Eng. Bull..

[4]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[5]  Oren Etzioni,et al.  Dynamic Reference Sifting: A Case Study in the Homepage Domain , 1997, Comput. Networks.

[6]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[7]  Bernardo A. Huberman,et al.  Email as spectroscopy: automated discovery of community structure within organizations , 2003 .

[8]  Doug Downey,et al.  Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison , 2004, AAAI.

[9]  Edward A. Fox,et al.  Machine Learning Approach for Homepage Finding Task , 2002, TREC.

[10]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[11]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[12]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[13]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[14]  Paul A. Viola,et al.  Interactive Information Extraction with Constrained Conditional Random Fields , 2004, AAAI.

[15]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[16]  Sebastian Thrun,et al.  Explanation-based neural network learning a lifelong learning approach , 1995 .

[17]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[18]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[19]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[20]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[21]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[22]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.