Subtopic segmentation of Chinese document: an adapted dotplot approach

An adapted dotplot model based on Chinese word sense quantization is presented to find the boundaries of subtopics in a document. The data reduction techniques of rough sets are introduced for the purpose of selecting axis words for word space. For discrete and filter data in the information table, the mutual information between axis words and feature words is calculated. Then the adapted model is constructed by replacing the counting identical words with the calculation of similarity between feature words. As a submodule of our InsunAbs Chinese auto-summarization system, its performance is indirectly evaluated through a quantitative evaluation. By comparison this adapted model outperforms the baseline and original dotplot model in the test experiments.