Cross-language headline generation for Hindi

This paper presents new approaches to headline generation for English newspaper texts, with an eye toward the production of document surrogates for document selection in cross-language information retrieval. This task is difficult because the user must make decisions about relevance based on (often poor) translations of retrieved documents. To facilitate the decision-making process we need translations that can be assessed rapidly and accurately; our approach is to provide an English headline for the non-English document. We describe two approaches to headline generation and their application to the recent DARPA TIDES-2003 Surprise Language Exercise for Hindi. For comparison, we also implemented an alternative method for surrogate generation: a system that produces topic lists for (Hindi) articles. We present the results of a series of experiments comparing each of these approaches. We demonstrate in both automatic and human evaluations that our linguistically motivated approach outperforms two other surrogate-generation methods: a statistical system and a topic discovery system.

[1]  Kevin Knight,et al.  Generation that Exploits Corpus-Based Statistical Knowledge , 1998, ACL.

[2]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[3]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[4]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[5]  Srinivas Bangalore,et al.  Exploiting a Probabilistic Hierarchical Model for Generation , 2000, COLING.

[6]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[7]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[8]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[9]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[10]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Chris D. Paice,et al.  The identification of important concepts in highly structured technical papers , 1993, SIGIR.

[12]  Noam Chomsky,et al.  Lectures on Government and Binding , 1981 .

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[15]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[16]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[17]  Klaus Zechner,et al.  Automatic text abstracting by selecting relevant passages , 1995 .

[18]  R. Schwartz,et al.  Unsupervised Topic Discovery , 2001 .

[19]  Richard M. Schwartz,et al.  A maximum likelihood model for topic classification of broadcast news , 1997, EUROSPEECH.

[20]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[21]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[22]  Richard M. Schwartz,et al.  BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.

[23]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[24]  Scott Miller,et al.  A Novel Use of Statistical Parsing to Extract Information from Text , 2000, ANLP.

[25]  Ingrid Mårdh,et al.  Headlinese : on the grammar of English front page headlines , 1980 .

[26]  William C. Mann,et al.  Rhetorical structure theory and text analysis , 1989 .

[27]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[28]  Frances C. Johnson,et al.  The application of linguistic processing to automatic abstract generation , 1997 .

[29]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[30]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[31]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..