论文信息 - Extracting and Rendering Representative Sequences

Extracting and Rendering Representative Sequences

This paper is concerned with the summarization of a set of categorical sequences. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighbourhood. The proposed heuristic for extracting the representative subset requires as main arguments a pairwise distance matrix, a representativeness criterion and a distance threshold under which two sequences are considered as redundant or, identically, in the neighborhood of each other. It first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in our TraMineR R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains.

[1] Gilbert Ritschard,et al. Extracting Knowledge from Life Courses: Clustering and Visualization , 2008, DaWaK.

[2] Chris Sander,et al. Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[3] U. Hobohm,et al. Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[4] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5] A. Abbott,et al. Sequence Analysis and Optimal Matching Methods in Sociology , 2000 .

[6] Michael Anyadike-Danes,et al. Predicting successful and unsuccessful transitions from school to work by using sequence methods , 2002 .

[7] Gilbert Ritschard,et al. Discrepancy Analysis of Complex Objects Using Dissimilarities , 2009, EGC.

[8] Desire L. Massart,et al. Representative subset selection , 2002 .

[9] Robert D. Clark,et al. OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets , 1997, J. Chem. Inf. Comput. Sci..