This work extends the semi-automatic grammar induction approach previously proposed (see Meng, H. and Siu, K.C., IEEE Trans. on Knowledge and Data Engineering). The data-driven approach learns semantic and phrasal categories from a training corpus of unannotated natural language queries in a specific domain. The approach can be seeded with prespecified semantic categories to expedite the learning process. Grammar rules are automatically acquired by an agglomerative clustering procedure, and the resulting grammar may be hand-edited easily for refinement. This work attempts to improve the grammar induction framework by leveraging information in the SQL query that accompanies every training query. The SQL expression specifies the action of database access in relation to the query, and hence provides information about meaningful natural language structures that should to be captured in induced grammar. We have also incorporated the use of information gain in place of mutual information to capture phrasal structures, as well as the determination of an automatic stopping criterion for agglomerative clustering.
[1]
Helen M. Meng,et al.
Semi-automatic acquisition of domain-specific semantic structures
,
1999,
EUROSPEECH.
[2]
Katsunobu Itou,et al.
Semi-automatic language model acquisition without large corpora
,
2000,
INTERSPEECH.
[3]
Hinrich Schütze,et al.
Book Reviews: Foundations of Statistical Natural Language Processing
,
1999,
CL.
[4]
Hong-Kwang Jeff Kuo,et al.
Statistical recursive finite state machine parsing for speech understanding
,
2000,
INTERSPEECH.
[5]
James R. Glass,et al.
Empirical acquisition of word and phrase classes in the atis domain
,
1993,
EUROSPEECH.
[6]
Keh-Yih Su,et al.
Corpus-based Automatic Compound Extraction with Mutual Information and Relative Frequency Count
,
1993,
ROCLING/IJCLCLP.
[7]
Raymond J. Mooney,et al.
Active Learning for Natural Language Parsing and Information Extraction
,
1999,
ICML.