Efficiently Extract Rrecurring Tree Fragments from Large Treebanks

In this paper we describe FragmentSeeker, a tool which is capable to identify all those tree constructions which are recurring multiple times in a large Phrase Structure treebank. The tool is based on an efficient kernel-based dynamic algorithm, which compares every pair of trees of a given treebank and computes the list of fragments which they both share. We describe two different notions of fragments we will use, i.e. standard and partial fragments, and provide the implementation details on how to extract them from a syntactically annotated corpus. We have tested our system on the Penn Wall Street Journal treebank for which we present quantitative and qualitative analysis on the obtained recurring structures, as well as provide empirical time performance. Finally we propose possible ways our tool could contribute to different research fields related to corpus analysis and processing, such as parsing, corpus statistics, annotation guidance, and automatic detection of argument structure.

[1]  C. Fillmore,et al.  Grammatical constructions and linguistic generalizations: The What's X doing Y? construction , 1999 .

[2]  Mikhail J. Atallah,et al.  Algorithms and Theory of Computation Handbook , 2009, Chapman & Hall/CRC Applied Algorithms and Data Structures series.

[3]  Phil Blunsom,et al.  Inducing Compact but Accurate Tree-Substitution Grammars , 2009, NAACL.

[4]  Joshua B. Tenenbaum,et al.  Fragment Grammars: Exploring Computation and Reuse in Language , 2009 .

[5]  David Chiang,et al.  Statistical Parsing with an Automatically-Extracted Tree Adjoining Grammar , 2000, ACL.

[6]  Alessandro Moschitti,et al.  Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees , 2006, ECML.

[7]  Simon Dennis,et al.  A Comparison of Statistical Models for the Extraction of Lexical Information from Text Corpor , 2003 .

[8]  M. Tomasello The item-based nature of children’s early syntactic development , 2000, Trends in Cognitive Sciences.

[9]  Rens Bod What is the Minimal Set of Fragments that Achieves Maximal Parse Accuracy? , 2001, ACL.

[10]  Ted Briscoe,et al.  Automatic Extraction of Subcategorization from Corpora , 1997, ANLP.

[11]  Rens Bod,et al.  A Computational Model of Language Performance: Data Oriented Parsing , 1992, COLING.

[12]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[13]  Willem H. Zuidema What are the Productive Units of Natural Language Grammar? A DOP Approach to the Automatic Identification of Constructions. , 2006, CoNLL.

[14]  Aravind K. Joshi,et al.  Tree Adjunct Grammars , 1975, J. Comput. Syst. Sci..

[15]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[16]  Zhiwei Lin,et al.  A Novel Algorithm for Counting All Common Subsequences , 2007, 2007 IEEE International Conference on Granular Computing (GRC 2007).

[17]  Willem H. Zuidema Parsimonious Data-Oriented Parsing , 2007, EMNLP-CoNLL.

[18]  Federico Sangati A simple DOP model for constituency parsing of Italian sentences , 2009 .