Using MathML Parallel Markup Corpora for Semantic Enrichment of Mathematical Expressions

This paper explores the problem of semantic enrichment of mathematical expressions. We formulate this task as the translation of mathematical expressions from presentation markup to content markup. We use MathML, an application of XML, to describe both the structure and content of mathematical notations. We apply a method based on statistical machine translation to extract translation rules automatically. This approach contrasts with previous research, which tends to rely on manually encoded rules. We also introduce segmentation rules used to segment mathematical expressions. Combining segmentation rules and translation rules strengthens the translation system and archives significant improvements over a prior rule-based system. key words: semantic enrichment, MathML markup, statistical machine translation

[1]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[2]  Akiko Aizawa,et al.  Mining Coreference Relations between Formulas and Text using Wikipedia , 2010 .

[3]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[4]  Michael Kohlhase,et al.  An Architecture for Linguistic and Semantic Analysis on the arXMLiv Corpus , 2009, GI Jahrestagung.

[5]  Stephen M. Watt,et al.  Mathematical Markup Language (MathML) Version 3.0 , 2001, WWW 2001.

[6]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[7]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[8]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[9]  Raymond J. Mooney,et al.  Learning for Semantic Parsing with Statistical Machine Translation , 2006, NAACL.

[10]  Brian W. Kernighan,et al.  A system for typesetting mathematics , 1975, Commun. ACM.

[11]  Stephan Oepen,et al.  Towards an ACL Anthology Corpus with Logical Document Structure. An Overview of the ACL 2012 Contributed Task , 2012, Discoveries@ACL.

[12]  Masakazu Suzuki,et al.  An Integrated OCR Software for Mathematical Documents and Its Output with Accessibility , 2004, ICCHP.

[13]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[14]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[15]  Dimitar Misev,et al.  MathML-aware Article Conversion from LaTeX , 2009 .

[16]  Mihai Grigore,et al.  Towards context-based disambiguation of mathematical expressions , 2009 .