Bootstrapping Lexical Choice via Multiple-Sequence Alignment

An important component of any generation system is the mapping dictionary, a lexicon of elementary semantic expressions and corresponding natural language realizations. Typically, labor-intensive knowledge-based methods are used to construct the dictionary. We instead propose to acquire it automatically via a novel multiple-pass algorithm employing multiple-sequence alignment, a technique commonly used in bioinformatics. Crucially, our method lever-ages latent information contained in multi-parallel corpora --- datasets that supply several verbalizations of the corresponding semantics rather than just one.We used our techniques to generate natural language versions of computer-generated mathematical proofs, with good results on both a per-component and overall-output basis. For example, in evaluations involving a dozen human judges, our system produced output whose readability and faithfulness to the semantic input rivaled that of a traditional generation system.

[1]  Daniel L. Chester,et al.  The Translation of Formal Proofs into English , 1976, Artif. Intell..

[2]  Rance Cleaveland,et al.  Implementing mathematics with the Nuprl proof development system , 1986 .

[3]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[4]  Chris Brew,et al.  Automatic Evaluation of Computer Generated Text: A Progress Report on the TextEval Project , 1994, HLT.

[5]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[6]  T.J.P. Hubbard,et al.  Gathering them in to the fold , 1996, Nature Structural Biology.

[7]  H. Thompson,et al.  Automatic Evaluation of Computer Generated Text : Final Report on the TextEval Project , 1996 .

[8]  Salim Roukos,et al.  Feature-based language understanding , 1997, EUROSPEECH.

[9]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Xiaorong Huang,et al.  Proof Verbalization as an Application of NLG , 1997, IJCAI.

[11]  Kevin Knight,et al.  Generation that Exploits Corpus-Based Statistical Knowledge , 1998, ACL.

[12]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[13]  Volker Sorge,et al.  LΩUI: Lovely ΩMEGA User Interface , 1999, Formal Aspects of Computing.

[14]  Robert L. Constable,et al.  Verbalization of High-Level Formal Proofs , 1999, AAAI/IAAI.

[15]  Michel Simard Text-Translation Alignment: Three Languages Are Better Than Two , 1999, EMNLP.

[16]  Adwait Ratnaparkhi,et al.  Trainable Methods for Surface Natural Language Generation , 2000, ANLP.

[17]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[18]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[19]  Srinivas Bangalore,et al.  Exploiting a Probabilistic Hierarchical Model for Generation , 2000, COLING.

[20]  Menno van Zaanen Bootstrapping Syntax and Recursion using Alginment-Based Learning , 2000, ICML.

[21]  Alexander I. Rudnicky,et al.  Stochastic Language Generation for Spoken Dialogue Systems , 2000 .

[22]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[23]  Hermann Ney,et al.  Natural language understanding using statistical machine translation , 2001, INTERSPEECH.

[24]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[25]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.