Improvements on a semi-automatic grammar induction framework

This work extends the semi-automatic grammar induction approach previously proposed (see Meng, H. and Siu, K.C., IEEE Trans. on Knowledge and Data Engineering). The data-driven approach learns semantic and phrasal categories from a training corpus of unannotated natural language queries in a specific domain. The approach can be seeded with prespecified semantic categories to expedite the learning process. Grammar rules are automatically acquired by an agglomerative clustering procedure, and the resulting grammar may be hand-edited easily for refinement. This work attempts to improve the grammar induction framework by leveraging information in the SQL query that accompanies every training query. The SQL expression specifies the action of database access in relation to the query, and hence provides information about meaningful natural language structures that should to be captured in induced grammar. We have also incorporated the use of information gain in place of mutual information to capture phrasal structures, as well as the determination of an automatic stopping criterion for agglomerative clustering.