Structured learning for semantic role labeling

The use of complex grammatical features in statistical language learning assumes the availability of large scale training data and good quality parsers, especially for language different from English. In this paper, we show how good quality FrameNet SRL systems can be obtained, without relying on full syntactic parsing, by backing off to surface grammatical representations and structured learning. This model is here shown to achieve state-of-art results in standard benchmarks, while its robustness is confirmed in poor training conditions, for a language different for English, i.e. Italian. 1 Linguistic Features for Inductive Tasks Language learning systems usually generalize linguistic observations into statistical models of higher level semantic tasks, such as Semantic Role Labeling (SRL). Statistical learning methods assume that lexical or grammatical aspects of training data are the basic features for modeling the different inferences. They are then generalized into predictive patterns composing the final induced model. Lexical information captures semantic information and fine grained context dependent aspects of the input data. However, it is largely affected by data sparseness as lexical evidence is often poorly represented in training. It is also difficult to be generalized and non scalable, as the development large scale lexical KBs is very expensive. Moreover, other crucial properties, such as word ordering, are neglected by lexical representations, as syntax must be also properly addressed. In semantic role labeling, the role of grammatical features has been outlined since the seminal work by [6]. Symbolic expressions derived from the parse trees denote the position and the relationship between an argument and its predicate, and they are used as features. Parse tree paths are such features, employed in [11] for semantic role labeling. Tree kernels, introduced by [4], model similarity between two training examples as a function of the shared parts of their parse trees. Applied to different tasks, from parsing [4] to semantic role labeling [16], tree kernels determine expressive representations for effective grammatical feature engineering. However, there is no free lunch in the adoption of lexical and grammatical features in complex NLP tasks. First, lexical information is hard to be properly generalized whenever the amount of training data is small. Large scale general-purpose lexicons are available, but their employment in specific tasks is not satisfactory: coverage in domain (or corpus)-specific tasks is often poor and domain adaptation is difficult. For R. Pirrone and F. Sorbello (Eds.): AI*IA 2011, LNAI 6934, pp. 238–249, 2011. c © Springer-Verlag Berlin Heidelberg 2011 Structured Learning for Semantic Role Labeling 239 example, the lack of lexical information is often claimed as the main responsible for significant performance drops in out-of-domain argument classification [20,11]. Corpus driven methods have been often advocated as a solution according to distributional views on lexical semantics. These are traditionally used to acquire meaning generalizations in an unsupervised fashion (e.g. [18,27]) through the analysis of distributions of word occurrences. In line with previous work, (e.g. [9,5]) we will pursue this line of research, by extending a supervised approach through the adoption of vector based models of lexical meaning, as discussed in Section 2.2. The adoption of grammatical features and tree kernels is also problematic. First, strict requirements exist in terms of the size of the training data set as high dimensionality spaces are generated, whose data sparseness can be prohibitive. Although specific forms of optimization have been proposed to limit their inherent complexity (e.g. [15]), tree kernels do not scale well over very large training data sets. Moreover, methods for extracting grammatical features from parse trees (e.g. [6]) are strongly biased by the parsing quality. Several studies showed that parsing inaccuracies significantly lower the quality of training data. In [19] experiments over gold parse trees are reported with an accuracy (93%) significantly higher than the ones derived by using automatically derived trees (i.e. vs. 79%). In [14] sequential tagging is applied for SRL and a comparative analysis between two SRL systems is reported. They share the same architecture, but are built on partial vs. full parsing input, respectively. Finally, [11] reports that the adoption of the syntactic parser restricts the correct treatment of FrameNet roles to only the 82% of them, i.e. the only ones that are grammatically recognized. This thus constitutes a strict upper bound for a SRL cascade based on full parsing material. We want to explore here a possible solution to the above problems through the adoption of shallow grammatical features that avoid the use of a full parser in SRL, combined with distributional models of lexical semantics. While parsing accuracy highly varies across corpora, the adoption of shallower features (e.g. POS n-grams), increases robustness, applicability and minimizes overfitting. At the same time, lexical information is made available in terms of vector spaces derived automatically from large (domain specific) corpora. This aims to increase the quality of the achievable generalization without strict requirements in terms of training data. The expected result is the design of flexible SRL systems, also applicable in poor training conditions, such as for languages where limited resources are available. The open research questions are: which shallow grammatical representation is suitable to support the learning of fine-grained semantic models? Are lexical generalizations provided by distributional analysis of large corpora helpful? Can the grammatical generalizations derived from shallow syntactic representations be profitably augmented through word space models of the involved lexical information? In the rest of this work, we will show that a structured learning framework, such as SV M [1], benefits from simpler features, and provides competitive performances with respect to richer syntactic representations. Moreover, a distributional method (i.e. Singular Value Decomposition) applied to unlabeled corpora is introduced in order to acquire effective lexical generalizations. The resulting model is then tested and its performances are observed in standard as well as poor training conditions characterizing two languages: English and Italian, respectively.