1 Introduction
Challenge:
To deal with highly sparse data
To scale with the increasing numbers of users and items
To make satisfactory recommendations in a short time period
To deal with problems like synonymy, i.e., same or similar items to have different names
To deal with shilling attacks, data noise, privacy protections
Methods:
Memory-based CF methods – similarity or weight btw users or items and make predictions or recommendations based on calculated similarity values.
Model-based CF methods – use only rating data to estimate or learn a model to make predictions. Bayesian belief nets CF, clustering CF models, latent semantic CF models, Markov decision process based CF.
Content-based filtering – analyzing the content of textual information and filing regularities in the content by using features of users and items.
Hybrid CF – combine content-boosted CF and content-based techniques.
2 Characteristics and Challenges of CF
Challenge:
2.1 Data Sparsity.
Cold start problem (new user problem or new item problem) occurs when a new user or item has just entered the system. It is difficult to find similar ones because there is not enough information. New items cannot be recommended until some users rate it, and new users are unlikely given good recommendations because of the lack of their rating or purchase history.
To alleviate the data sparsity problem, many approaches have been proposed:
Singular Value Decomposition (SVD) removes unrepresentative or insignificant users or items to reduce the dimensionalities of the user-item matrix directly. Latent Sematic Indexing (LSI) used SVD to calculate similarity between users by the representation of the users in the reduced space.
Principle Component Analysis (PCA), a closely-related factor analysis techniques to reduce dimensionality.
Model-based CF to address the sparsity problem
Naïve Bayes optimized by extended logistic regression providing more accurate predictions for sparse data.
Association retrieval technique applies an associative retrieval framework to explore transitive associations among users through their rating and purchase history.
Maximum margin matrix factorizations, Multiple imputation-based CF, Imputation-boosted CF
Hybrid CF to address the sparsity problem
Aspect model latent variable method for cold start combines both collaborative and content information in model fitting.
Probabilistic model to address the cold start problem classifies items into groups and predictions are made for users considering the Gaussian distribution of user ratings.
2.2 Scalability.
Numbers of existing users and items grow tremendously with computationally resources going beyond practical or acceptable levels.
SVD deal with the scalability problem and quickly produce good quality recommendations but they have to undergo expensive matrix factorization steps.
Memory-based CF
Item-based Pearson CF address scalability by calculating similarities btw only pairs of co-rated items instead of all pairs.
Model-based CF
Clustering CF address the scalability problem by seeking users for recommendation within smaller and highly similar clusters instead of the entire database.
2.3 Synonymy.
It refers to the tendency of a number of the same or every similar items to have different names or entries.
Singular Value Decomposition (SVD) constructs a sematic space where terms and documents that are closely associated are placed closely to each other.
2.4 Gray Sheep.
It refers to the users whose opinions do not consistently agree or disagree with any group of people and thus do not benefit from CF.
Hybrid approach can combine content-based and CF recommendations by basing a prediction on a weighted average of the content-based prediction and the CF prediction.
2.5 Shilling Attacks.
People may give tons of positive recommendations for their own materials and negative recommendations for their competitors.
Item-based CF was less affected by the attacks than the user-based CF.
Remove global effects in the data normalization stage of the neighbor-based CF, and working with residual of global effects to select neighbors.
2.6 Others
1 Protect users’ privacy
2 Increased noise / sabotage
3 Explain-ability
2.7 The Netflix Prize Challenge.
2009 Merged Model of latent factor and neighborhood models.
3 Memory-Based Collaborative Filtering Techniques
To generate a top-N recommendation, need to find k most similar users or items (nearest neighbors) after computing the similarities, then aggregate the neighbors to get the top-N most frequent items as the recommendation.
3.1 Similarity Computation.
3.1.1 Correlation-Based Similarity
For the user based algorithm, wu,v
For the item based algorithm, wi,,j
Pearson correlation
Spearman rank correlation
3.1.2 Vector Cosine-Based Similarity
Vector cosine similarity
Adjusted cosine similarity has the same formula as Pearson correlation
3.2 Prediction and Recommendation Computation
3.2.1 Weighted Sum of Others Ratings.
Take weighted average of all the ratings on the item across all users.
3.2.2 Simple Weighted Average
Take weighted average of all the ratings on the user cross all items.
3.3 Top-N Recommendations.
3.3.1 User-Based Top-N Recommendation Algorithms.
Look for k most similar users.
Aggregate to identify set of items based on frequency within users.
Recommend the top-N most frequent items the active user not purchased yet.
3.3.2 Item-Based Top-N Recommendation Algorithms.
First Look for k most similar items.
Then identify the set as candidates of recommended items by taking the union of the k most similar items.
And remove each of the items in the set that the user has already purchase.
3.4 Extension to Memory-Based Algorithms
3.4.1 Default Voting.
Pairwise similarity is computed only from the ratings sin the intersection of the items both users have rated.
Focus on intersection set similarity neglects the global rating behavior reflected in a user’s entire rating history will not be reliable.
3.4.2 Inverse User Frequency.
Universally liked items are not as useful in capturing similarity as less common items. fj=log(n/nj)
3.4.3 Case Amplification
Reduce noise in the data. It tends to favor high weights as small values raised to a power become negligible.
3.4.4 Imputation-Boosted CF.
Fit in the missing data.
IBCF use naïve bayes perform especially well, outperforming the content-boosted CF algorithm (a representative hybrid CF), and do so without using external content information.
3.4.5 Weighted Majority Prediction.
Make prediction using the rows with observed in the same column, weighted by the believed similarity between the rows, with binary rating values.
4 Model-Based CF
4.1 Bayesian Belief Net CF
4.1.1 Simple Bayesian CF
For incomplete data, the probability calculation and classification production are computed over observed data.
The Laplace estimator is used to smooth the probability calculation and avoid a conditional probability of o.
4.1.2 NB-ELR and TAN-ELR CF
ELR: Extended logistic regression
TAR-ELR: tree augmented naïve Bayes
NB-ELR: naïve Bayes optimized by ELR
4.1.3 Other Bayesian CF.
Bayesian belief nets with decision trees at each node.
4.2 Clustering CF.
A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.
Similarity btw objects: Minkowski distance, pearson correlation
Partitioning method: k-means
Density-based method: DBSCAN
Hierarchical clustering method: BIRCH
Clustering models have better scalability than typical CF.
4.3 Clustering CF
Y is sparse.
Sparse factor analysis replaces missing elements with default voting values, which replaces missing elements with default voting values.
4.4 MDP-based CF
MDPs: Markov decision processes (MDPs), a model for sequential stochastic decision problem.
(S, A, R, Pr)
S: state
A: action
R: reward function, i.e., net profit
Pr: transition probability
POMDPs: partial observable MDP
4.5 Latent Semantic CF
Latent class variables in a mixture model setting to discover user communities and prototypical interest profiles.
Higher accuracy and scalability.
An aspect model is a probabilistic latent-space model, which models individual ratings as a convex combination of rating factors.
Latent Dirichlet Allocation (LDA, a generative probabilistic model, in which each item is modeled as a finite mixture over an underlying set of users).
4.6 Other Model-Based CF
Two-stage order learning CF
Association rule based CF
Maximum entropy approach
Dependency network
Multiple multiplicative factor models (MMFs)
Probabilistic principal components analysis (pPCA)
5 Hybrid CF
Content-based techniques have the start-up problem.
Demographic-based recommender systems
Utility-based recommender systems
Knowledge-based recommender systems
5.1 Hybrid Recommenders Incorporating CF and Content-Based Features.
Content-boosted CF
Baysian preference model
5.2 Hybrid Recommenders Combining CF and Other Recommender Systems.
Weighted hybrid recommender
Weighted majority voting
Weighted average voting
Switching hybrid recommender
Mixed hybrid recommenders
Cascade hybrid recommenders
5.3 Hybrid Recommenders Combining CF
PMCF: Probabilistic memory-based collaborative filtering
PD: Personality diagnosis
6 Evaluation Metrics
6.1 Mean Absolute Error(MAE) & Normalized MAE
6.2 RMSE
6.3 ROC sensitivity
7 Conclusions
No comments:
Post a Comment