Identification of Boundaries in Parallel Noun Phrases: a Probabilistic Swapping Model

Parallel structure is a way to factor out common constituents in the expressions, which makes an effect of simplification of expressions. The complexity can be greatly reduced at the phase of sentence parsing by identifying such boundaries of parallel structure. In this paper, we propose a probabilistic model to identify parallel cores (corresponding constituents) as well as boundaries of parallel noun phrases conjoined by "wa/gwa" (conjunctive particle in Korean). It is based on the idea of swapping constituents, utilizing symmetry (two or more identical constituents are repeated) and reversibility (the order of constituents is changeable) in parallel structure. The probabilities are calculated from (unlabelled) corpus with parallel structures, which is an advantage over the approaches trained with labeled corpus. Our model, moreover, is not dependent on languages. It is also shown that the semantic features of the modifiers around parallel noun phrase and the patterns among words can be utilized further to correct the boundaries identified by the swapping model. Experiment shows that our probabilistic swapping model performs much better than symmetry-based model and machine learning based approaches.