Clustering the Space of Phrases Identied by an Ensemble of Supervised Shallow Parsers

We present a novel method for clustering syntactic structures in raw text. The method involves first generating an ensemble of classifiers, by training a statistical parser on samples from the training data. Then, the classifier outputs for each instance are used as input for clustering. The resulting clusters group together instances whose internal representations in the statistical model are similar, and therefore depend on the feature set it uses. We apply our method to the simple task of clustering noun phrases. Experiments show that most encouraging results are obtained when the training samples are small. Possible applications of the method include speeding up error analysis of a supervised system, and adapting a supervised system to a new domain.