Question Pre-Processing in a QA System on Internet Discussion Groups

This paper proposes methods to pre-process questions in the postings before a QA system can find answers in a discussion group in the Internet. Pre-processing includes garbage text removal and question segmentation. Garbage keywords are collected and different length thresholds are assigned to them for garbage text identification. Interrogative forms and question types are used to segment questions. The best performance on the test set achieves 92.57% accuracy in garbage text removal and 85.87% accuracy in question segmentation, respectively.