Discover Linguistic Patterns in Parsed Corpus with Frequent Subrtree Mining

Recognition of special linguistic patterns in a certain language is very helpful for many NLP applications such as information extraction, machine translation and parsing. State-of-the-arts syntax parsers are based on given grammar. The used grammar is context free and cannot discover complex patterns which contain multiple linguistic units. We propose an unsupervised method to automatically discover the complex linguistic patterns from a classically parsed corpus. A specialized and efficient algorithm is applied to mine the frequent subtrees in the forest and the found subtrees are formalized as the linguistic patterns. The approach is validated on the Penn Chinese Treebank with found linguistic patterns.

[1]  Richard Cole,et al.  Tree pattern matching and subset matching in deterministic O(n log3 n)-time , 1999, SODA '99.

[2]  Ron Shamir,et al.  Faster subtree isomorphism , 1997, Proceedings of the Fifth Israeli Symposium on Theory of Computing and Systems.

[3]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[5]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[6]  Mark Johnson,et al.  Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques , 2002, ACL.

[7]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[8]  Yun Chi,et al.  HybridTreeMiner: an efficient algorithm for mining frequent rooted trees and free trees using canonical forms , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[9]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[10]  Chen Wang,et al.  Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining , 2004, PAKDD.

[11]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[12]  Sen Zhang,et al.  Unordered tree mining with applications to phylogeny , 2004, Proceedings. 20th International Conference on Data Engineering.

[13]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[15]  Takashi Washio,et al.  Complete Mining of Frequent Patterns from Graphs: Mining Graph Data , 2003, Machine Learning.

[16]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.