Exploiting hierarchical domain structure to compute similarity

The notion of similarity between objects finds use in many contexts, for example, in search engines, collaborative filtering, and clustering. Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We extend our similarity measures to provide appropriate results in the presence of multisets (also handled unsatisfactorily by traditional measures), for example, to correctly compute the similarity between customers who buy several instances of the same product (say milk), or who buy several products in the same category (say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and report on a user study that evaluated how well our measures match human intuition.

[1]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[2]  Manuel de Buenaga Rodríguez,et al.  Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[3]  John Riedl,et al.  Application of Dimensionality Reduction in Recommender System - A Case Study , 2000 .

[4]  Bradley N. Miller,et al.  Using filtering agents to improve prediction quality in the GroupLens research collaborative filtering system , 1998, CSCW '98.

[5]  Myoung-Ho Kim,et al.  Information Retrieval Based on Conceptual Distance in is-a Hierarchies , 1993, J. Documentation.

[6]  John Riedl,et al.  GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[7]  Alan F. Smeaton,et al.  Using WordNet in a Knowledge-Based Approach to Information Retrieval , 1995 .

[8]  C. Lee Giles,et al.  CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications , 1998, AGENTS '98.

[9]  John Riedl,et al.  Combining Collaborative Filtering with Personal Agents for Better Recommendations , 1999, AAAI/IAAI.

[10]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[11]  Anupam Joshi,et al.  On Mining Web Access Logs , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[12]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[13]  Jin H. Kim,et al.  A Model of Knowledge Based Information Retrieval with Hierarchical Concept Graph , 1990, J. Documentation.

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  Garcia-MolinaHector,et al.  Exploiting hierarchical domain structure to compute similarity , 2003 .

[16]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[17]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[18]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[19]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[20]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[21]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[22]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[23]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[24]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[25]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[26]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[27]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[28]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[29]  R. Sibson Order Invariant Methods for Data Analysis , 1972 .

[30]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[31]  Dagobert Soergel,et al.  Mathematical analysis of documentation systems : An attempt to a theory of classification and search request formulation , 1967, Inf. Storage Retr..

[32]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[33]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[34]  Heikki Mannila,et al.  Similarity of Attributes by External Probes , 1998, KDD.

[35]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[36]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[37]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[38]  Anupam,et al.  Mining Web Access Logs Using Relational Competitive Fuzzy Clustering , 1999 .

[39]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[40]  Kaizhong Zhang,et al.  Approximate tree pattern matching , 1997 .

[41]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.