The Treegram Index-An Efficient Technique for Retrieval in Linguistic Treebanks

Multiway trees (MT, henceforth) are a common and well-understood data structure for describing hierarchical linguistic information. With the availability of large treebanks, retrieval techniques for highly structured data now become essential. In this contribution, we investigate the efficient retrieval of MT structures at the cost of a complex index---the Treegram Index.We illustrate our approach with the VENONA retrieval system, which handles the BHt (Biblia Hebraica transcripta) treebank comprising 508,650 phrase structure trees with maximum degree eight and maximum height 17, containing altogether 3.3 million Old-Hebrew words.