My DS Coding Bolg: 5 - A survey of CF Techniques

1 Introduction

Challenge:

To deal with highly sparse data

To scale with the increasing numbers of users and items

To make satisfactory recommendations in a short time period

To deal with problems like synonymy, i.e., same or similar items to have different names

To deal with shilling attacks, data noise, privacy protections

Methods:

Memory-based CF methods – similarity or weight btw users or items and make predictions or recommendations based on calculated similarity values.

Model-based CF methods – use only rating data to estimate or learn a model to make predictions. Bayesian belief nets CF, clustering CF models, latent semantic CF models, Markov decision process based CF.

Content-based filtering – analyzing the content of textual information and filing regularities in the content by using features of users and items.

Hybrid CF – combine content-boosted CF and content-based techniques.

2 Characteristics and Challenges of CF

Challenge:

2.1 Data Sparsity.

Cold start problem (new user problem or new item problem) occurs when a new user or item has just entered the system. It is difficult to find similar ones because there is not enough information. New items cannot be recommended until some users rate it, and new users are unlikely given good recommendations because of the lack of their rating or purchase history.

To alleviate the data sparsity problem, many approaches have been proposed:

Singular Value Decomposition (SVD) removes unrepresentative or insignificant users or items to reduce the dimensionalities of the user-item matrix directly. Latent Sematic Indexing (LSI) used SVD to calculate similarity between users by the representation of the users in the reduced space.

Principle Component Analysis (PCA), a closely-related factor analysis techniques to reduce dimensionality.

Model-based CF to address the sparsity problem

Naïve Bayes optimized by extended logistic regression providing more accurate predictions for sparse data.

Association retrieval technique applies an associative retrieval framework to explore transitive associations among users through their rating and purchase history.

Maximum margin matrix factorizations, Multiple imputation-based CF, Imputation-boosted CF

Hybrid CF to address the sparsity problem

Aspect model latent variable method for cold start combines both collaborative and content information in model fitting.

Probabilistic model to address the cold start problem classifies items into groups and predictions are made for users considering the Gaussian distribution of user ratings.

2.2 Scalability.

Numbers of existing users and items grow tremendously with computationally resources going beyond practical or acceptable levels.

SVD deal with the scalability problem and quickly produce good quality recommendations but they have to undergo expensive matrix factorization steps.

Memory-based CF

Item-based Pearson CF address scalability by calculating similarities btw only pairs of co-rated items instead of all pairs.

Model-based CF

Clustering CF address the scalability problem by seeking users for recommendation within smaller and highly similar clusters instead of the entire database.

2.3 Synonymy.

It refers to the tendency of a number of the same or every similar items to have different names or entries.

Singular Value Decomposition (SVD) constructs a sematic space where terms and documents that are closely associated are placed closely to each other.

2.4 Gray Sheep.

It refers to the users whose opinions do not consistently agree or disagree with any group of people and thus do not benefit from CF.

Hybrid approach can combine content-based and CF recommendations by basing a prediction on a weighted average of the content-based prediction and the CF prediction.

2.5 Shilling Attacks.

People may give tons of positive recommendations for their own materials and negative recommendations for their competitors.

Item-based CF was less affected by the attacks than the user-based CF.

Remove global effects in the data normalization stage of the neighbor-based CF, and working with residual of global effects to select neighbors.

2.6 Others

1 Protect users’ privacy

2 Increased noise / sabotage

3 Explain-ability

2.7 The Netflix Prize Challenge.

2009 Merged Model of latent factor and neighborhood models.

3 Memory-Based Collaborative Filtering Techniques

To generate a top-N recommendation, need to find k most similar users or items (nearest neighbors) after computing the similarities, then aggregate the neighbors to get the top-N most frequent items as the recommendation.

3.1 Similarity Computation.

3.1.1 Correlation-Based Similarity

For the user based algorithm, w_u,v

For the item based algorithm, w_i,,j

Pearson correlation

Spearman rank correlation

3.1.2 Vector Cosine-Based Similarity

Vector cosine similarity

Adjusted cosine similarity has the same formula as Pearson correlation

3.2 Prediction and Recommendation Computation

3.2.1 Weighted Sum of Others Ratings.

Take weighted average of all the ratings on the item across all users.

3.2.2 Simple Weighted Average

Take weighted average of all the ratings on the user cross all items.

3.3 Top-N Recommendations.

3.3.1 User-Based Top-N Recommendation Algorithms.

Look for k most similar users.

Aggregate to identify set of items based on frequency within users.

Recommend the top-N most frequent items the active user not purchased yet.

3.3.2 Item-Based Top-N Recommendation Algorithms.

First Look for k most similar items.

Then identify the set as candidates of recommended items by taking the union of the k most similar items.

And remove each of the items in the set that the user has already purchase.

3.4 Extension to Memory-Based Algorithms

3.4.1 Default Voting.

Pairwise similarity is computed only from the ratings sin the intersection of the items both users have rated.

Focus on intersection set similarity neglects the global rating behavior reflected in a user’s entire rating history will not be reliable.

3.4.2 Inverse User Frequency.

Universally liked items are not as useful in capturing similarity as less common items. f_j=log(n/n_j)

3.4.3 Case Amplification

Reduce noise in the data. It tends to favor high weights as small values raised to a power become negligible.

3.4.4 Imputation-Boosted CF.

Fit in the missing data.

IBCF use naïve bayes perform especially well, outperforming the content-boosted CF algorithm (a representative hybrid CF), and do so without using external content information.

3.4.5 Weighted Majority Prediction.

Make prediction using the rows with observed in the same column, weighted by the believed similarity between the rows, with binary rating values.

4 Model-Based CF

4.1 Bayesian Belief Net CF

4.1.1 Simple Bayesian CF

For incomplete data, the probability calculation and classification production are computed over observed data.

The Laplace estimator is used to smooth the probability calculation and avoid a conditional probability of o.

4.1.2 NB-ELR and TAN-ELR CF

ELR: Extended logistic regression

TAR-ELR: tree augmented naïve Bayes

NB-ELR: naïve Bayes optimized by ELR

4.1.3 Other Bayesian CF.

Bayesian belief nets with decision trees at each node.

4.2 Clustering CF.

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

Similarity btw objects: Minkowski distance, pearson correlation

Partitioning method: k-means

Density-based method: DBSCAN

Hierarchical clustering method: BIRCH

Clustering models have better scalability than typical CF.

4.3 Clustering CF

Y is sparse.

Sparse factor analysis replaces missing elements with default voting values, which replaces missing elements with default voting values.

4.4 MDP-based CF

MDPs: Markov decision processes (MDPs), a model for sequential stochastic decision problem.

(S, A, R, Pr)

S: state

A: action

R: reward function, i.e., net profit

Pr: transition probability

POMDPs: partial observable MDP

4.5 Latent Semantic CF

Latent class variables in a mixture model setting to discover user communities and prototypical interest profiles.

Higher accuracy and scalability.

An aspect model is a probabilistic latent-space model, which models individual ratings as a convex combination of rating factors.

Latent Dirichlet Allocation (LDA, a generative probabilistic model, in which each item is modeled as a finite mixture over an underlying set of users).

4.6 Other Model-Based CF

Two-stage order learning CF

Association rule based CF

Maximum entropy approach

Dependency network

Multiple multiplicative factor models (MMFs)

Probabilistic principal components analysis (pPCA)

5 Hybrid CF

Content-based techniques have the start-up problem.

Demographic-based recommender systems

Utility-based recommender systems

Knowledge-based recommender systems

5.1 Hybrid Recommenders Incorporating CF and Content-Based Features.

Content-boosted CF

Baysian preference model

5.2 Hybrid Recommenders Combining CF and Other Recommender Systems.

Weighted hybrid recommender

Weighted majority voting

Weighted average voting

Switching hybrid recommender

Mixed hybrid recommenders

Cascade hybrid recommenders

5.3 Hybrid Recommenders Combining CF

PMCF: Probabilistic memory-based collaborative filtering

PD: Personality diagnosis

6 Evaluation Metrics

6.1 Mean Absolute Error(MAE) & Normalized MAE

6.2 RMSE

6.3 ROC sensitivity

7 Conclusions

My DS Coding Bolg

Thursday, October 15, 2015

5 - A survey of CF Techniques

No comments:

Post a Comment

Blog Archive