An Efficient, Generic Approach to Extracting Multi-Word Expressions from Dependency Trees

The Varro toolkit offers an intuitive mechanism for extracting syntactically motivated multi-word expressions (MWEs) from dependency treebanks by looking for recurring connected subtrees instead of subsequences in strings. This approach can find MWEs that are in varying orders and have words inserted into their components. This paper also proposes description length gain as a statistical correlation measure well-suited to tree structures.