This paper describes a corpus of syntactic structures and associated sentences. However, it is not a traditional treebank. The syntactic structures are created first and are then associated with sentences in a human language. We therefore call it a reverse treebank (RTB).1 The RTB has been created for elicitation of sentences in low resource languages. First, a corpus of feature structures is created using a tool suite built by the authors. The second step is to add sentences in a widely spoken language like English or Spanish that express the meanings of each feature structure. We will call this language the Elicitation Language. The third step is to have a bilingual informant translate the sentences into a low resource language. Using an elicitation tool, the informant can also graphically align the words of the Elicitation Language to the words of the low resource language. The result is a high quality parallel, word aligned corpus annotated with feature structures, which we will call a parallel RTB. RTB sentences may have multiple clauses, but they are generally short in comparison to naturally occurring sentences in treebanks. The reason is that parallel RTBs provide small, but highly structured corpora for machine learning with small amounts of resources. Corpora such as these have been used for automatic learning of transfer rules for machine translation [8].
[1]
B. Comrie,et al.
Lingua descriptive studies: Questionnaire
,
1977
.
[2]
Eva Hajičová.
Dependency-based underlying-structure tagging of a very large Czech corpus
,
2000
.
[3]
Donna Gates,et al.
The MILE Corpus for Less Commonly Taught Languages
,
2006,
HLT-NAACL.
[4]
Daniel Gildea,et al.
The Proposition Bank: An Annotated Corpus of Semantic Roles
,
2005,
CL.
[5]
David R. Dowty.
Thematic proto-roles and argument selection
,
1991
.
[6]
Robert D. Van Valin,et al.
Functional Syntax and Universal Grammar
,
1984
.
[7]
Lori Levin,et al.
Automatic Learning of Grammatical Encoding
,
2006
.