Topical Sequence Profiling

This paper introduces the problem of topical sequence profiling. Given a sequence of text collections such as the annual proceedings of a conference, the topical sequence profile is the most diverse explicit topic embedding for that text collection sequence that is both representative and minimal. Topic embeddings represent a text collection sequence as numerical topic vectors by storing the relevance of each text collection for each topic. Topic embeddings are called explicit if human readable labels are provided for the topics. A topic embedding is representative for a sequence, if for each text collection the percentage of documents that address at least one of the topics exceeds a predefined threshold. If no topic can be removed from the embedding without loosing representativeness, the embedding is minimal. From the set of all minimal representative embeddings, the one with the highest mean topic variance is sought and termed as the topical sequence profile. Topical sequence profiling can be used to highlight significant topical developments, such as raise, decline, or oscillation. The computation of topical sequence profiles is made up of two steps, topic acquisition and topic selection. In the first step, the sequence's text collections are mined for representative candidate topics. As a source for semantically meaningful topic labels, we propose the use of Wikipedia article titles, whereas the respective articles are used to build a classifier for the assignment of topics to documents. Within the second step the subset of candidate topics that constitutes the topical sequence profile is determined, for which we present an efficient greedy selection strategy. We demonstrate the potential of topical sequence profiling as an effective data science technology with a case study on a sequence of conference proceedings.

[1]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[2]  Timothy W. Finin,et al.  Wikipedia as an Ontology for Describing Documents , 2008, ICWSM.

[3]  Andrea Marino,et al.  Topical clustering of search results , 2012, WSDM '12.

[4]  Iryna Gurevych,et al.  Bringing Order to Digital Libraries: From Keyphrase Extraction to Index Term Assignment , 2013, D Lib Mag..

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Sivanesan Dakshanamurthy,et al.  Big data: the next frontier for innovation in therapeutics and healthcare , 2014, Expert review of clinical pharmacology.

[7]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[8]  Benno Stein,et al.  Topic Identification: Framework and Application , 2022 .

[9]  Dawid Weiss,et al.  Comprehensible and Accurate Cluster Labels in Text Clustering , 2007, RIAO.

[10]  Claus Weihs,et al.  Classification as a Tool for Research , 2010 .

[11]  Qinghua Zheng,et al.  DF-Miner: Domain-specific facet mining by leveraging the hyperlink structure of Wikipedia , 2015, Knowl. Based Syst..

[12]  James P. Callan,et al.  An experimental study on automatically labeling hierarchical clusters using statistical features , 2006, SIGIR.

[13]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[14]  Benno Stein,et al.  Unsupervised Sparsification of Similarity Graphs , 2010 .

[15]  Fabrizio Sebastiani,et al.  Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution , 2006, SPIRE.

[16]  Ryoji Kataoka,et al.  A clustering method for news articles retrieval system , 2005, WWW '05.

[17]  Derek Greene,et al.  Unsupervised graph-based topic labelling using dbpedia , 2013, WSDM.

[18]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[19]  Matthias Hagen,et al.  From keywords to keyqueries: content descriptors for the web , 2013, SIGIR.

[20]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..