Fast structural query with application to chinese treebank sentence retrieval

In natural language processing a huge amount of structured data is constantly used for the extraction and presentation of grammatical structures in sentences. For example the Chinese Treebank corpus developed at the Institute of Information Science Academia Sinica Taiwan is a semantically annotated corpus that has been used to help parse and study Chinese sentences. In this setting users usually use structured tree patterns instead of keywords to query the corpus. In this paper we present an online prototype system that provides exploratory search ability. The system implements two flexible and efficient structural query methods and employs a user-friendly web-based interface. Although the system adopts the XML format to present the corpora and search results it does not use conventional XML query languages. As searching the Chinese Treebank corpora is structural in nature and often deals with structural similarities conventional XML query languages such as XPath and XQuery are inflexible and inefficient. We propose and implement a query algorithm called Parent-Child Relationship Filter <i>(PCRF)</i> which provides flexible and efficient structural search. <i>PCRF</i> is sufficiently flexible to provide several similarity-matching options such as wildcard unordered sibling sub-trees ancestor-descendant matching and their combinations. In addition <i>PCRF</i> supports stream-based matching to help users query their XML documents online. We also present three accelerating rules that achieve a 1.5- to 8-fold performance improvement in query time. Our experiment results show that our method archive a 10- to 1000-fold performance improvement compared to the usual text-based XPath query method.

[1]  Alberto O. Mendelzon,et al.  Indexing XML Data with ToXin , 2001, WebDB.

[2]  Rajeev Rastogi,et al.  Efficient filtering of XML documents with XPath expressions , 2002, The VLDB Journal.

[3]  Toshiyuki Amagasa,et al.  XRel: a path-based approach to storage and retrieval of XML documents using relational databases , 2001, ACM Trans. Internet Techn..

[4]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[5]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[6]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[7]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[8]  Michael J. Franklin,et al.  Efficient Filtering of XML Documents for Selective Dissemination of Information , 2000, VLDB.

[9]  Fei Xia,et al.  Comparing and integrating Tree Adjoining Grammars , 2000, TAG+.

[10]  Kemal Oflazer Error-Tolerant Retrieval of Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Ioana Manolescu,et al.  The XML benchmark project , 2001 .

[12]  Yanlei Diao,et al.  YFilter: efficient and scalable filtering of XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[14]  Fei Xia,et al.  Comparing Lexicalized Treebank Grammars Extracted from Chinese, Korean, and English Corpora , 2000, ACL 2000.

[15]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[16]  Oliver Streiter Reliability in example-based parsing , 2000, TAG+.

[17]  Makoto Onizuka Light-weight xPath processing of XML stream with deterministic automata , 2003, CIKM '03.

[18]  Kaizhong Zhang,et al.  ATreeGrep: approximate searching in unordered trees , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.