Multi-Field Information Extraction and Cross-Document Fusion

In this paper, we examine the task of extracting a set of biographic facts about target individuals from a collection of Web pages. We automatically annotate training text with positive and negative examples of fact extractions and train Rote, Naive Bayes, and Conditional Random Field extraction models for fact extraction from individual Web pages. We then propose and evaluate methods for fusing the extracted information across documents to return a consensus answer. A novel cross-field bootstrapping method leverages data interdependencies to yield improved performance.

[1]  Bonnie Webber,et al.  Information Fusion for Answering Factoid Questions , 2003 .

[2]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[3]  Ellen Riloff,et al.  An Empirical Approach to Conceptual Case Frame Acquisition , 1998, VLC@COLING/ACL.

[4]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[5]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[6]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.

[7]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[8]  Charles L. A. Clarke,et al.  Exploiting redundancy in question answering , 2001, SIGIR '01.

[9]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[10]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[11]  Claire Cardie,et al.  Multidocument Summarization via Information Extraction , 2001, HLT.

[12]  Inderjeet Mani,et al.  Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics , 2001, ACL.

[13]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[14]  Scott B. Huffman,et al.  Learning information extraction patterns from examples , 1995, Learning for Natural Language Processing.

[15]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[16]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[17]  Jimmy J. Lin,et al.  Data-Intensive Question Answering , 2001, TREC.

[18]  Raymond J. Mooney,et al.  Text mining with information extraction , 2004 .

[19]  David W. Masterson,et al.  Information Extraction from Multi-Document Threads , 2003 .

[20]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[21]  Razvan C. Bunescu,et al.  Collective Information Extraction with Relational Markov Networks , 2004, ACL.

[22]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  Jennifer Chu-Carroll,et al.  Question Answering Using Constraint Satisfaction: QA-By-Dossier-With-Contraints , 2004, ACL.