Hierarchical topic detection in large digital news archives: Exploring a sample based approach

Hierarchical topic detection is a new task in the TDT 2004 evaluation program, which aims to organize a collection of unstructured news data in a directed acyclic graph (DAG) structure, refecting the topics discussed in the collection, ranging from rather coarse category like nodes to file singular events. The HTD task poses interesting challenges since its evaluation metric is composed of a travel cost component refecting the time to fhd the node of interest starting from the top node and a quality cost component, determined by the quality of the selected node. We present a scalable architecture for HTD and compare several alternative choices for agglomerative clustering and DAG optimization in order to minimize the HTD cost metric. The alternatives are evaluated on the TDT3 and TDT5 test collections.