Automatic Syntactic Segment Filtration for Mass Syntax Corpus with Mutual Information

Syntactic analysis (Syntactic parsing) is an important method in the natural language processing. The Syntactic parsing aims to find a linguistic structure of a sentence with the knowledge of a certain grammar. The constituent parser which can build hierarchical structure with the phrase segments is the most popular method in nowadays NLP applications. Many approaches have been done to the parsing algorithms to improve the precision and recall of the found syntactic segments. In this paper, we propose a novel method to greatly improve the precision of the syntactic segments without dig into the parsing algorithms. The method is introduced as a post-processing which filters the syntactic segments according to their mutual information with the context. The new method can obtain a high confidential subset from a mass syntax corpus and is independent with the parsing algorithms. The effectiveness of the approach is validated by the experimental results.