LIF at TAC MultiLing: Towards a Truly Language Independent Summarizer

This paper presents the LIF system for the TAC’2011 Multilingual pilot track. We followed a language-independent approach to summarization for this task. In particular, we tried to remove the following dependences to language: sentence segmentation, word segmentation, stop-word lists, and word-level relevance assessment. We applied these modifications to an MMR-based system and observed little degradation on English data. The submitted system had a bug that impeded all official results, therefore we propose in this paper an updated set of results with relevant analysis.

[1]  Mohamed S. Kamel,et al.  Automatic Extraction of Domain-Specific Stopwords from Labeled Documents , 2008, ECIR.

[2]  Ani Nenkova,et al.  Syntactic Simplification for Improving Content Selection in Multi-Document Summarization , 2004, COLING.

[3]  Fei Song,et al.  Probabilistic Document Modeling for Syntax Removal in Text Summarization , 2011, ACL.

[4]  Daniel Gildea,et al.  Unsupervised Tokenization for Machine Translation , 2009, EMNLP.

[5]  Jen-Tzung Chien,et al.  Latent Dirichlet learning for document summarization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  David Yarowsky,et al.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.

[7]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[8]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[9]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[10]  Ryan T. McDonald A Study of Global Inference Algorithms in Multi-document Summarization , 2007, ECIR.

[11]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[12]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[13]  Claire Cardie,et al.  Multidocument Summarization via Information Extraction , 2001, HLT.

[14]  Noah A. Smith,et al.  Variational Inference for Adaptor Grammars , 2010, NAACL.

[15]  Florian Boudin,et al.  A Graph-based Approach to Cross-language Multi-document Summarization , 2011, Polibits.

[16]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[17]  Ivan Titov,et al.  A Joint Model of Text and Aspect Ratings for Sentiment Summarization , 2008, ACL.

[18]  Rada Mihalcea,et al.  Language Independent Extractive Summarization , 2005, ACL.

[19]  Vasileios Hatzivassiloglou,et al.  Event-Based Extractive Summarization , 2004 .

[20]  Dan Klein,et al.  Jointly Learning to Extract and Compress , 2011, ACL.

[21]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[22]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..