Cross-lingual C*ST*RD: English access to Hindi information

We present C*ST*RD, a cross-language information delivery system that supports cross-language information retrieval, information space visualization and navigation, machine translation, and text summarization of single documents and clusters of documents. C*ST*RD was assembled and trained within 1 month, in the context of DARPA's Surprise Language Exercise, that selected as source a heretofore unstudied language, Hindi. Given the brief time, we could not create deep Hindi capabilities for all the modules, but instead experimented with combining shallow Hindi capabilities, or even English-only modules, into one integrated system. Various possible configurations, with different tradeoffs in processing speed and ease of use, enable the rapid deployment of C*ST*RD to new languages under various conditions.

[1]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[2]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[3]  James Allan,et al.  Aspect windows, 3-D visualizations, and indirect comparisons of information retrieval systems , 1998, SIGIR '98.

[4]  David Dubin Document analysis for visualization , 1995, SIGIR '95.

[5]  James Allan,et al.  Interactive Information Retrieval Using Clustering and Spatial Proximity , 2004, User Modeling and User-Adapted Interaction.

[6]  R. Schwartz,et al.  Automatic Headline Generation for Newspaper Stories , 2002 .

[7]  Kathleen R. McKeown,et al.  Columbia multi-document summarization : Approach and evaluation , 2001 .

[8]  James Allan,et al.  Interactive information organization: techniques and evaluation , 2001 .

[9]  Ulrich Germann,et al.  Greedy Decoding for Statistical Machine Translation in Almost Linear Time , 2003, NAACL.

[10]  Min Song BiblioMapper: a cluster-based information visualization technique , 1998, Proceedings IEEE Symposium on Information Visualization (Cat. No.98TB100258).

[11]  Douglas W. Oard,et al.  Rapid-response machine translation for unexpected languages , 2003, MTSUMMIT.

[12]  Kerry Rodden,et al.  Evaluating a visualisation of image similarity as a tool for image browsing , 1999, Proceedings 1999 IEEE Symposium on Information Visualization (InfoVis'99).

[13]  Liang Zhou,et al.  Headline Summarization at ISI , 2003 .

[14]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[15]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[16]  James Allan,et al.  INQUERY Does Battle With TREC-6 , 1997, TREC.

[17]  James Allan,et al.  Evaluating combinations of ranked lists and visualizations of inter-document similarity , 2001, Inf. Process. Manag..

[18]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[19]  James Allan,et al.  INQUERY at TREC-5 , 1996, TREC.

[20]  Matthias Hemmje,et al.  LyberWorld—a visualization user interface supporting fulltext retrieval , 1994, SIGIR '94.

[21]  W. Bruce Croft,et al.  An Evaluation of Techniques for Clustering Search Results , 2005 .

[22]  Rong Jin,et al.  Title Generation Using a Training Corpus , 2001, CICLing.

[23]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[24]  Nicholas J. Belkin,et al.  A case for interaction: a study of interactive information retrieval behavior and effectiveness , 1996, CHI.

[25]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[26]  Matthew Chalmers,et al.  Bead: explorations in information visualization , 1992, SIGIR '92.

[27]  Ulrich Germann Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[28]  Alexander G. Hauptmann,et al.  Headline Generation using a Training Corpus , 2001 .

[29]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[30]  Anton Leuski,et al.  iNeATS: Interactive Multi-Document Summarization , 2003, ACL.

[31]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[32]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[33]  Anton Leuski,et al.  Relevance and reinforcement in interactive browsing , 2000, CIKM '00.

[34]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[35]  Eduard H. Hovy,et al.  Identifying Topics by Position , 1997, ANLP.

[36]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[37]  AllanJames,et al.  Interactive Information Retrieval Using Clustering and Spatial Proximity , 2004 .

[38]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[39]  Eduard H. Hovy,et al.  From Single to Multi-document Summarization , 2002, ACL.

[40]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[41]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[42]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[43]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[44]  James Allan,et al.  Evaluating a Visual Navigation System for a Digital Library , 1998, ECDL.

[45]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[46]  Marc R. Ilgen,et al.  DEPICT: Documents Evaluated as Pictures. Visualizing information using context vectors and self-organizing maps , 1996, Proceedings IEEE Symposium on Information Visualization '96.

[47]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[48]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[49]  Gerald Salton,et al.  Automatic text processing , 1988 .

[50]  Alfred Kobsa User Modeling and User-Adapted Interaction , 2005, User Modeling and User-Adapted Interaction.

[51]  Gerard Salton,et al.  Optimization of relevance feedback weights , 1995, SIGIR '95.

[52]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[53]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[54]  Hermann Ney,et al.  Generation of Word Graphs in Statistical Machine Translation , 2002, EMNLP.

[55]  James Allan,et al.  Strategy-based interactive cluster visualization for information retrieval , 2000, International Journal on Digital Libraries.

[56]  Paul Over,et al.  Intrinsic Evaluation of Generic News Text Summarization Systems , 2003 .

[57]  Chin-Yew Lin,et al.  From Single to Multi-document Summarization : A Prototype System and its Evaluation , 2002 .

[58]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[59]  Robert J. Hendley,et al.  Narcissus: visualising information , 1995 .

[60]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[61]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[62]  Ellen M. Voorhees,et al.  The fifth text REtrieval conference (TREC-5) , 1997 .

[63]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[64]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[65]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .