Scalable Collaborative Filtering based on Latent Semantic Indexing

Nearest-neighbor collaborative filtering (CF) algorithms are gaining widespread acceptance in recommender systems and e-commerce applications. User ratings are not expected to be independent, as users follow trends of similar rating behavior. In terms of Text Mining, this is analogous to the formation of higher-level concepts from plain terms. In this paper, we propose a novel CF algorithm which uses Latent Semantic Indexing (LSI) to detect rating trends and performs recommendations according to them. We perform an extensive experimental evaluation, with two real data sets, and produce results that indicate its superiority over existing CF algorithms. Introduction The “information overload” problem affects our everyday experience while searching for valuable knowledge. To overcome this problem, we often rely on suggestions from others who have more experience on a topic. In Web case, this is more manageable with the introduction of Collaborative Filtering (CF), which provides recommendations based on the suggestions of users who have similar preferences. Two types of CF algorithms have been proposed in the literature: memory-based algorithms, which recommend according to the preferences of nearest neighbors, and modelbased algorithms, which recommend by first developing a model of user ratings. Related research has reported that memory-based algorithms (a.k.a. nearest-neighbor algorithms) present excellent performance, in terms of accuracy. Their basic drawback is that they cannot handle scalability and sparsity. This means that they face performance problems, when the volume of data is extremely big and sparse. Latent Semantic Indexing (LSI) has been extensively used in informational retrieval, to detect the latent semantic relationships between terms and documents. LSI constructs a low-rank approximation to the term-document matrix. As a result, it produces a less noisy matrix, which is better than the original one. Thus, higher level concepts are generated from plain terms. In CF, this is analogous to the formation of users’ trends from individual preferences. This work is conducted while the first two authors were scholars of the State Scholarships Foundation of Greece (IKY). Copyright c © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. In this paper, we propose a new algorithm that is based on LSI to produce a condensed model for the user-item matrix. This model comprises a matrix that captures the main user trends and presents a two-fold advantage: (i) it removes noise by focusing on main rating trends and not on particularities of each individual user, (ii) its size is much smaller than the original matrix, thus it can speedup the searching for similar users/items. Our contribution and novelty are summarized as follows: (i) based on Information Retrieval, we include the pseudouser concept in order to compare it with our processed data. This differs our method from related work (Sarwar et al. 2000b), where Singular Value Decomposition (SVD) methods have used only to summarize the user-item matrix for dimensionality reduction. (ii) We implement a novel algorithm, which tunes the number of principal components according to the data characteristics. (iii)We generalize the recommendation procedure for both userand item-based CF methods. (iv) We generate predictions based on the users’ neighbors and not based on the test user itself, as it has been reported in related work so far. (v) We propose a new top-N generation list algorithm based on SVD and the Highest Prediction Rated items. The rest of this paper is organized as follows. We summarize the related work and analyze the CF factors. We describe the proposed approach and give experimental results. Finally, we conclude this paper.

[1]  John Riedl,et al.  Analysis of recommendation algorithms for e-commerce , 2000, EC '00.

[2]  John Riedl,et al.  GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[3]  Kenneth Y. Goldberg,et al.  Eigentaste: A Constant Time Collaborative Filtering Algorithm , 2001, Information Retrieval.

[4]  John Riedl,et al.  An algorithmic framework for performing collaborative filtering , 1999, SIGIR '99.

[5]  John Riedl,et al.  Application of Dimensionality Reduction in Recommender System - A Case Study , 2000 .

[6]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[7]  John Riedl,et al.  An Empirical Analysis of Design Choices in Neighborhood-Based Collaborative Filtering Algorithms , 2002, Information Retrieval.

[8]  Thomas Hofmann,et al.  Latent semantic models for collaborative filtering , 2004, TOIS.

[9]  John Riedl,et al.  Incremental SVD-Based Algorithms for Highly Scaleable Recommender Systems , 2002 .

[10]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[11]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[12]  Jonathan L. Herlocker,et al.  A collaborative filtering algorithm and evaluation metric that accurately model the user experience , 2004, SIGIR '04.

[13]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[14]  G. Karypis,et al.  Incremental Singular Value Decomposition Algorithms for Highly Scalable Recommender Systems , 2002 .

[15]  Richard A. Harshman,et al.  Information retrieval using a singular value decomposition model of latent semantic structure , 1988, SIGIR '88.

[16]  George Karypis,et al.  Evaluation of Item-Based Top-N Recommendation Algorithms , 2001, CIKM '01.

[17]  Neil J. Hurley,et al.  Collaborative recommendation: A robustness analysis , 2004, TOIT.

[18]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.