论文信息 - Topical Sequence Profiling

Topical Sequence Profiling

This paper introduces the problem of topical sequence profiling. Given a sequence of text collections such as the annual proceedings of a conference, the topical sequence profile is the most diverse explicit topic embedding for that text collection sequence that is both representative and minimal. Topic embeddings represent a text collection sequence as numerical topic vectors by storing the relevance of each text collection for each topic. Topic embeddings are called explicit if human readable labels are provided for the topics. A topic embedding is representative for a sequence, if for each text collection the percentage of documents that address at least one of the topics exceeds a predefined threshold. If no topic can be removed from the embedding without loosing representativeness, the embedding is minimal. From the set of all minimal representative embeddings, the one with the highest mean topic variance is sought and termed as the topical sequence profile. Topical sequence profiling can be used to highlight significant topical developments, such as raise, decline, or oscillation. The computation of topical sequence profiles is made up of two steps, topic acquisition and topic selection. In the first step, the sequence's text collections are mined for representative candidate topics. As a source for semantically meaningful topic labels, we propose the use of Wikipedia article titles, whereas the respective articles are used to build a classifier for the assignment of topics to documents. Within the second step the subset of candidate topics that constitutes the topical sequence profile is determined, for which we present an efficient greedy selection strategy. We demonstrate the potential of topical sequence profiling as an effective data science technology with a case study on a sequence of conference proceedings.