Statistical-based Approach for Indonesian Complex Factoid Question Decomposition

This research has proposed a method to decompose complex factoid question into several independent questions. The method comprises four stages: (1) classifying input question into several categories such as sub-question, coordination, exemplification, or double question, (2) generating all possible question boundary candidates, (3) selecting the best question boundary, and (4) performing the question decomposition rule using the best question boundary. This study compared several machine learning algorithms in the first stage (complex factoid question classification) and third stage (question decomposition boundary selection). The features used in the classification are specific word lists with its related information including the syntactic features of POS (Part of Speech) tag. For the experiments, we annotated 916 sentences for training data and 226 sentences for testing data. The perplexity of the annotated corpus achieved 1.000586 with 307 Out of Vocabulary (OOV). The complex factoid question classification accuracy reached 93.8% with Random Forest algorithm. The question decomposition boundary selection accuracy achieved 93.80% for sub-question (using Random Forest algorithm), 86.11% for double question (using Random Forest algorithm), 88.23% for coordination (using SMO), and 60.87% for exemplification (using kNN, NB, and RF). A revision rule was provided for the question decomposition boundary selection that improved the accuracy into 97.22% for double question, 94.11% for coordination, and 65.21% for exemplification.