An Unsupervised Approach to Biography Production Using Wikipedia

We describe an unsupervised approach to multi-document sentence-extraction based summarization for the task of producing biographies. We utilize Wikipedia to automatically construct a corpus of biographical sentences and TDT4 to construct a corpus of non-biographical sentences. We build a biographical-sentence classifier from these corpora and an SVM regression model for sentence ordering from the Wikipedia corpus. We evaluate our work on the DUC2004 evaluation data and with human judges. Overall, our system significantly outperforms all systems that participated in DUC2004, according to the ROUGE-L metric, and is preferred by human subjects.

[1]  Ralph Grishman,et al.  NYU's English ACE 2005 System Description , 2005 .

[2]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[3]  Sasha Blair-Goldensohn,et al.  Answering Definitional Questions: A Hybrid Approach , 2004, New Directions in Question Answering.

[4]  Marie-Francine Moens,et al.  Multidocument Question Answering Text Summarization Using Topic Signatures , 2005, J. Digit. Inf. Manag..

[5]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[6]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[7]  Jinxi Xu,et al.  A Hybrid Approach to Answering Biographical Questions , 2004, New Directions in Question Answering.

[8]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[9]  Sujian Li,et al.  Multi-document Summarization Using Support Vector Regression , 2007 .

[10]  Liang Zhou,et al.  Multi-Document Biography Summarization , 2005, EMNLP.

[11]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[12]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.

[13]  Regina Barzilay,et al.  Sentence Ordering in Multidocument Summarization , 2001, HLT.

[14]  Ani Nenkova,et al.  References to Named Entities: a Corpus Study , 2003, HLT-NAACL.

[15]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[16]  Elena Filatova,et al.  Tell Me What You Do and I'll Tell You What You Are: Learning Occupation-Related Activities for Biographies , 2005, HLT/EMNLP.

[17]  Kathleen McKeown,et al.  Statistical Acquisition of Content Selection Rules for Natural Language Generation , 2003, EMNLP.

[18]  David Evans,et al.  Columbia University at DUC 2004 , 2004 .