Context-Based Text Mining for Insights in Long Documents

In this paper, we consider long documents and try to find differences between document collections. In the analysis of document collections such as project status reports or annual reports, each document and each sentence tend to be relatively long. Therefore, it can be difficult to derive insights by looking only for representative concepts in the selected document collection based on a divergence metric. In this paper, we propose an analysis approach based on contextual information. By extracting pairs of a topic word and a keyword and assessing their representativeness in the selected document collection, we are developing a method to extract insights from these long documents. Applying the proposed method for the analysis between the annual reports of bankrupt companies and those of sound companies, we were able to derive insights that could not be extracted with the conventional methods.