Combining machine learning and rule-based approaches in Spanish syntactic generation

Aquesta tesi descriu una gramatica de Generacio que combina regles escrites a ma i tecniques daprenentatge automatic. Aquesta gramatica pertany a un sistema de Traduccio Automatica de qualitat comercial desenvolupat a Microsoft Research. La primera part presenta la gramatica i les principals estrategies linguistiques que aquesta gramatica implementa. Els requeriments de robustesa que reclama lus real del sistema de TA, exigeix del Generador un esforc suplementari que es resol afegint un nivell de pre-generacio, capac de garantir la integritat de lentrada, sense incorporar elements ad-hoc en les regles de la gramatica. A la segona part, explorem lus dels classificadors darbres de decisio (DT) per tal daprendre automaticament una de les operacions que tenen lloc al modul de pre-generacio, en concret la seleccio lexica del verb copulatiu en espanyol (ser o estar). Mostrem que es possible inferir a partir dexemples els contextos per aquest fenomen linguistic no trivial, amb gran precisio. Resumen This thesis describes a Spanish Generation grammar which combines hand-written rules and Machine Learning techniques. This grammar belongs to a full-scale commercial quality Machine Translation system developed at Microsoft Research. The first part presents the grammar and the linguistic strategies it embodies. The need for robustness in real-world situations in the everyday use of the MT system requires from the Generator an extra effort which is resolved by adding a Pre-Generation layer which is able to fix the input to Generation, without contaminating the grammar rules. In the second part we explore the use of Decision Tree classifiers (DT) for automatically learning one of the operations that take place in the Pre-Generation component, namely lexical selection of the Spanish copula (i.e. ser and estar). We show that it is possible to infer from examples the contexts for this non-trivial linguistic phenomenon with high accuracy.

[1]  Srinivas Bangalore,et al.  Evaluation Metrics for Generation , 2000, INLG.

[2]  Helmut Horacek,et al.  A Flexible Shallow Approach to Text Generation , 1998, INLG.

[3]  Nicoletta Calzolari,et al.  Multilingual Summarization by Integrating Linguistic Resources in the MLIS-MUSI Project , 2002, LREC.

[4]  Eric Nyberg,et al.  The GenKit and Transformation Kit User''''s Guide , 1988 .

[5]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[6]  Gunther Kress,et al.  System and Function in Language , 1978 .

[7]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[8]  James L. McClelland,et al.  Explorations in parallel distributed processing: a handbook of models, programs, and exercises , 1988 .

[9]  Margarita Porroche Ballesteros,et al.  Ser, estar y verbos de cambio , 1988 .

[10]  Michael Gamon,et al.  An Overview of Amalgam: A Machine-learned Generation Module , 2002, INLG.

[11]  Pete Whitelock,et al.  Shake-and-Bake Translation , 1992, COLING.

[12]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[13]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[14]  Fred Popowich,et al.  Explanation-based Learning for Machine Translation , 1999, ArXiv.

[15]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[16]  Edmundo Tovar,et al.  Standardization of the Generation Process in a Multilingual Environment , 2005 .

[17]  Arul Menezes,et al.  Achieving commercial-quality translation with example-based methods , 2001, MTSUMMIT.

[18]  John A. Bateman,et al.  Enabling technology for multilingual natural language generation: the KPML development environment , 1997, Natural Language Engineering.

[19]  John Hutchins,et al.  The development and use of machine translation systems and computer-based translation tools , 1999 .

[20]  Teruko Mitamura,et al.  The KANT System: Fast, Accurate, High-Quality Translation in Practical Domains , 1992, COLING.

[21]  Chris Mellish,et al.  Towards Evaluation in Natural Language Generation , 1998, LREC.

[22]  Chris Brew,et al.  Letting the Cat Out of the Bag: Generation for Shake-and-Bake MT , 1992, COLING.

[23]  Michael Gamon,et al.  A Machine Learning Approach to the Automatic Evaluation of Machine Translation , 2001, ACL.

[24]  Robert Dale,et al.  Handbook of Natural Language Processing , 2001, Computational Linguistics.

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[26]  Harold L. Somers,et al.  Review Article: Example-based Machine Translation , 1999, Machine Translation.

[27]  I. Bosque,et al.  Gramática descriptiva de la lengua española , 1999 .

[28]  Nizar Habash,et al.  Handling translation divergences: combining statistical and symbolic techniques in generation-heavy machine translation , 2002, AMTA.

[29]  M. Carl,et al.  Reversible Template-based Shake & Bake Generation , 2005, MTSUMMIT.

[30]  Anja Belz,et al.  Statistical Generation: Three Methods Compared and Evaluated , 2005, ENLG.

[31]  Michael Gamon,et al.  Machine-learned contexts for linguistic operations in German sentence realization , 2002, ACL.

[32]  Michael Gamon,et al.  Using Machine Learning for System-Internal Evaluation of Transferred Linguistic Representations , 2001 .

[33]  Peter Poller,et al.  An extended architecture for robust generation , 2000, INLG.

[34]  Srinivas Bangalore,et al.  Exploiting a Probabilistic Hierarchical Model for Generation , 2000, COLING.

[35]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[36]  Chris Mellish,et al.  Evaluation in the context of natural language generation , 1998, Comput. Speech Lang..

[37]  Shimei Pan,et al.  SEGUE: A Hybrid Case-Based Surface Natural Language Generator , 2004, INLG.

[38]  J. Falk Visión de norma general versus visión de norma individual , 1979 .

[39]  Richard Campbell,et al.  Language-Neutral Representation of Syntactic Structure , 2002 .

[40]  Donato Malerba,et al.  The effects of pruning methods on the predictive accuracy of induced decision trees , 1999 .

[41]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[42]  Robert C. Moore Towards a Simple and Accurate Statistical Approach to Learning Translation Relationships among Words , 2001, DDMMT@ACL.

[43]  Richard Campbell,et al.  Machine Translation as a Testbed for Multilingual Analysis , 2002, COLING 2002.

[44]  Richard Campbell Computation of Modifier Scope in NP by a Language-neutral Method , 2002, COLING.

[45]  Richard Campbell,et al.  Language-Neutral Syntax: An Overview , 2002 .

[46]  Michael Elhadad,et al.  An Overview of SURGE: a Reusable Comprehensive Syntactic Realization Component , 1996, INLG.

[47]  Jessie Pinkham,et al.  Adding Domain Specificity to an MT System , 2001, DDMMT@ACL.

[48]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[49]  Jessie Pinkham,et al.  Modular MT with a Learned Bilingual Dictionary: Rapid Deployment of a New Language Pair , 2002, COLING.

[50]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[51]  W. J. Hutchins Machine Translation: Past, Present, Future , 1986 .

[52]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[53]  Benoit Lavoie,et al.  A Fast and Portable Realizer for Text Generation Systems , 1997, ANLP.

[54]  John A. Bateman,et al.  Target Suites for Evaluating the Coverage of Text Generators , 2000, LREC.

[55]  Andy Way,et al.  Toward a Hybrid Integrated Translation Environment , 2002, AMTA.

[56]  Doug Arnold,et al.  Machine Translation: An Introductory Guide , 1994 .

[57]  Jessie Pinkham,et al.  Rapid Assembly of a Large-scale French-English MT System , 2001 .

[58]  Karen Miller,et al.  Acquisition of copulas ser and estar in Spanish: learning lexico-semantics, syntax and discourse , 2004 .

[59]  Andrei Popescu-Belis,et al.  Principles of Context-Based Machine Translation Evaluation , 2002, Machine Translation.

[60]  Srinivas Bangalore,et al.  Impact of Quality and Quantity of Corpora on Stochastic Generation , 2001, EMNLP.

[61]  Asunción Gómez-Pérez,et al.  ONTOGENERATION: Reusing Domain and Linguistic Ontologies for Spanish Text Generation , 1998 .

[62]  James C. Lester,et al.  Narrative prose generation , 2001, Artif. Intell..

[63]  Takako Aikawa,et al.  Generation for multilingual MT , 2001, MTSUMMIT.

[64]  Lucy Vanderwende,et al.  MindNet: Acquiring and Structuring Semantic Information from Text , 1998, COLING-ACL.

[65]  Σωκράτης Σοφιανόπουλος,et al.  Monolingual Corpus-based MT using Chunks , 2005 .

[66]  Simon Corston-Oliver Combining Decision Trees And Transformation-Based Learning To Correct Transferred Linguistic Representations , 2003 .

[67]  John R. Pierce,et al.  Language and Machines: Computers in Translation and Linguistics , 1966 .

[68]  Arul Menezes,et al.  Overcoming the customization bottleneck using example-based MT , 2001, DDMMT@ACL.

[69]  David Maxwell Chickering,et al.  A Bayesian Approach to Learning Bayesian Networks with Local Structure , 1997, UAI.

[70]  Michael Elhadad,et al.  FUF: the Universal Unifier User Manual Version 5.2 , 1991 .

[71]  Michael White,et al.  EXEMPLARS: A Practical, Extensible Framework For Dynamic Text Generation , 1998, INLG.

[72]  John Hutchins Machine translation and computer- based translation aids , 2003 .

[73]  Jessie Pinkham,et al.  Tools for Large-Scale Parser Development , 2000, ELSPS.

[74]  Irene Langkilde-Geary,et al.  An Empirical Verification of Coverage and Correctness for a General-Purpose Sentence Generator , 2002, INLG.

[75]  J. Bresnan Lexical-Functional Syntax , 2000 .

[76]  Matthew Marge,et al.  Evaluating Evaluation Methods for Generation in the Presence of Variation , 2005, CICLing.

[77]  Charles B. Callaway,et al.  Multilingual Natural Language Generation for 3D Learning Environments , 1999 .

[78]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[79]  Kevin Knight,et al.  Automated Postediting of Documents , 1994, AAAI.

[80]  John A. Bateman,et al.  Multilingual Natural Language Generation for Multilingual Software: A Functional Linguistic Approach , 1999, Appl. Artif. Intell..

[81]  J. O. Olivares,et al.  Usos de "ser" y "estar" , 1987 .

[82]  Francis Bond,et al.  Memory-Based Learning for Article Generation , 2000, CoNLL/LLL.

[83]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[84]  Takako Aikawa,et al.  Combining Machine Learning and Rule-based Approaches in Spanish and Japanese Sentence Realization , 2002, INLG.

[85]  Richard Campbell,et al.  A Language-Neutral Representation of Temporal Information , 2002 .

[86]  James C. Lester,et al.  Developing and Empirically Evaluating Robust Explanation Generators: The KNIGHT Experiments , 1997, Comput. Linguistics.

[87]  Stephen D. Richardson Bootstrapping Statistical Processing into a Rule-Based Natural Language Parser , 1994 .

[88]  Ralph Grishman,et al.  Combining rationalist and empiricist approaches to machine translation , 1992, TMI.

[89]  Eduard Hovy,et al.  Aspects of Automated Natural Language Generation , 1992, Lecture Notes in Computer Science.

[90]  María Jesús Fernández Leboráns La predicación: las oraciones copulativas , 1999 .

[91]  Arul Menezes,et al.  A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora , 2001, DDMMT@ACL.

[92]  Mary A. Flanagan,et al.  Error Classification for MT Evaluation , 1994, AMTA.

[93]  Jerome H. Friedman,et al.  A Recursive Partitioning Decision Rule for Nonparametric Classification , 1977, IEEE Transactions on Computers.

[94]  Stephan Busemann,et al.  Issues in Generating Text from Interlingua Representations , 2005 .

[95]  Frank Van Eynde,et al.  The Eurotra linguistic specifications: An overview , 2004, Machine Translation.

[96]  Lucy Vanderwende,et al.  Combining Dictionary-Based and Example-Based Methods for Natural Language Analysis , 1993, TMI.

[97]  Marisa Jiménez Generation of named entities , 2001, MTSUMMIT.

[98]  Kevin Knight,et al.  Generation that Exploits Corpus-Based Statistical Knowledge , 1998, ACL.

[99]  Ivan Bratko,et al.  ASSISTANT 86: A Knowledge-Elicitation Tool for Sophisticated Users , 1987, EWSL.

[100]  Kiyoshi Yamabana,et al.  A pattern-learning based, hybrid model for the syntactic analysis of structural relationships among Japanese clauses , 1992, TMI.

[101]  Takako Aikawa,et al.  Multilingual Sentence Generation , 2001, EWNLG@ACL.

[102]  Paul Piwek,et al.  Natural Language Generation , 2004, Lecture Notes in Computer Science.

[103]  Adwait Ratnaparkhi,et al.  Trainable Methods for Surface Natural Language Generation , 2000, ANLP.

[104]  Mark T. Maybury Natural language generation , 1988 .

[105]  Peter Dirix,et al.  Example-based Translation Without Parallel Corpora: First Experiments on a Prototype , 2005, MTSUMMIT.

[106]  Karen Jensen,et al.  Natural Language Processing: The PLNLP Approach , 2013, Natural Language Processing.

[107]  Lucy Vanderwende,et al.  Automatically Deriving Structured Knowledge Bases From On-Line Dictionaries , 1993 .

[108]  Zhu Zhang,et al.  Extraposition: A Case Study in German Sentence Realization , 2002, COLING.

[109]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .