Building a Discourse Parser for Informal Mathematical Discourse in the Context of a Controlled Natural Language

The lack of specific data sets makes difficult the discourse parsing for Informal Mathematical Discourse (IMD). In this paper, we propose a data driven approach to identify arguments and connectives in an IMD structure within the context of Controlled Natural Language (CNL). Our approach follows a low-level discourse parsing under Peen Discourse TreeBank (PDTB) guidelines. Three classifiers have been trained: one that identifies the Arg2, other that locates the relative position of Arg1 and a third that identifies the (Arg1 and Arg2) arguments of each connective. These classifiers are instances of Support Vector Machines (SVMs), fed from an own Mathematical TreeBank. Finally, our approach defines an End-to-End discourse parser into IMD, whose results will be used to classify of informal deductive proofs via the low level discourse in IMD processing.

[1]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[2]  Richard Johansson,et al.  Shallow Discourse Parsing with Conditional Random Fields , 2011, IJCNLP.

[3]  Alan Lee,et al.  Attribution and the (Non-)Alignment of Syntactic and Discourse Arguments of Connectives , 2005, FCA@ACL.

[4]  Christoph Lüth,et al.  A Framework for Interactive Proof , 2007, Calculemus/MKM.

[5]  James Pustejovsky,et al.  Automatically Identifying the Arguments of Discourse Connectives , 2007, EMNLP.

[6]  Magdalena Wolska,et al.  A Language Engineering Architecture for Processing Informal Mathematical Discourse , 2008 .

[7]  Raúl Ernesto Gutiérrez de Piñerez Reyes,et al.  Preprocessing of informal mathematical discourse in context ofcontrolled natural language , 2012, CIKM '12.

[9]  Eric A. Weiss,et al.  Association for computing machinery (ACM) , 2003 .

[10]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[11]  Norbert E. Fuchs Controlled Natural Language , 2012, Lecture Notes in Computer Science.

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Ivana Kruijff-Korbayová,et al.  An Annotated Corpus of Tutorial Dialogs on Mathematical Theorem Proving , 2004, LREC.

[14]  Fairouz Kamareddine,et al.  Narrative Structure of Mathematical Texts , 2007, Calculemus/MKM.

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[17]  Christophe Raffalli,et al.  MathAbs: a representational language for mathematics , 2010, FIT.

[18]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[19]  Hwee Tou Ng,et al.  A PDTB-styled end-to-end discourse parser , 2012, Natural Language Engineering.

[20]  Peter Koepke,et al.  The Naproche Project: Controlled Natural Language Proof Checking of Mathematical Texts , 2009, CNL.

[21]  Daniel M. Bikel,et al.  Design of a multi-lingual, parallel-processing statistical parsing engine , 2002 .

[22]  Nicholas Asher,et al.  Reference to abstract objects in discourse , 1993, Studies in linguistics and philosophy.

[23]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[24]  Jason Baldridge,et al.  Discourse Connective Argument Identification with Connective Specific Rankers , 2008, 2008 IEEE International Conference on Semantic Computing.