The challenge of syntactic dependency parsing adaptation for the patent domain

Patents are legal documents with a proper complex discourse making it difficult to use off-the-shelf syntactic parsers to adequately process them. The annotation of a training corpus is a titanic task that cannot be afforded for a single project. In this paper, we present a methodology for adapting a dependency parser to the patent genre which only requires the addition of minimal genre specific annotated sentences and minor domain-adaptations to the treebank. After identifying the principal problems faced by the parser, we added to the training corpus sentences that condense and maximize the information brought to the model. The resulting models allow parsing of patents with performances similar to newspaper data.

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Joakim Nivre,et al.  Memory-Based Dependency Parsing , 2004, CoNLL.

[3]  Joakim Nivre,et al.  Feature Description for the Transition-Based Parser for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing , 2012 .

[4]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[5]  Jonas Kuhn,et al.  The Best of Both Worlds – A Graph-based Completion Model for Transition-based Parsers , 2012, EACL.

[6]  Marie Candito,et al.  A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts , 2011, IWPT.

[7]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[8]  Lilja Øvrelid,et al.  Informed ways of improving data-driven dependency parsing for German , 2010, COLING.

[9]  Gabriela Ferraro,et al.  Improving the comprehension of legal documentation: the case of patent claims , 2009, ICAIL.

[10]  Giuseppe Attardi,et al.  Experiments with a Multilanguage Non-Projective Dependency Parser , 2006, CoNLL.

[11]  Ki-Young Lee,et al.  Customizing an English-Korean Machine Translation System for Patent/Technical Documents Translation , 2009, PACLIC.

[12]  Jason Eisner,et al.  Three New Probabilistic Models for Dependency Parsing: An Exploration , 1996, COLING.

[13]  Leo Wanner,et al.  A development environment for MTT-based sentence generators: demonstration note , 2000 .

[14]  John Blitzer,et al.  Frustratingly Hard Domain Adaptation for Dependency Parsing , 2007, EMNLP.

[15]  Richard Johansson,et al.  The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages , 2009, CoNLL Shared Task.