Xerox offers a bewildering array of printers and software configurations to satisfy the need of production print shops. A configuration tool in the hands of sales analysts elicits requirements from customers and recommends a list of product configurations. This tool generates special question and answer case logs that provide useful historical data. Given the unusual semi-structured question and answer format, this data is not amenable to any standard document clustering method. The authors discovered that a hierarchical agglomerative approach using a compression-based dissimilarity measure (CDM) provided readily interpretable clusters. The authors compared this method empirically to two reasonable alternatives, latent semantic analysis and probabilistic latent semantic analysis, and conclude that CDM offers an accurate and easily implemented approach to validate and augment our configuration tool
[1]
Eamonn J. Keogh,et al.
Towards parameter-free data mining
,
2004,
KDD.
[2]
Thomas Hofmann,et al.
Unsupervised Learning by Probabilistic Latent Semantic Analysis
,
2004,
Machine Learning.
[3]
Tong Sun,et al.
Modeling and Assessment of Production Printing Workflows Using Petri Nets
,
2005,
Business Process Management.
[4]
Bin Ma,et al.
The similarity metric
,
2001,
IEEE Transactions on Information Theory.
[5]
Trevor I. Dix,et al.
Sequence Complexity for Biological Sequence Analysis
,
2000,
Comput. Chem..
[6]
Richard A. Harshman,et al.
Indexing by Latent Semantic Analysis
,
1990,
J. Am. Soc. Inf. Sci..
[7]
Li Wei,et al.
Compression-based data mining of sequential data
,
2007,
Data Mining and Knowledge Discovery.