Wednesday, October 28, 2015

Part 1: Arules in R

Association Rules for Market Basket Analysis

Need to examine a strategy to extract insights from transactions and cooccurrence data sets by association rule mining. Association rule analysis attempts to find sets of informative patterns from large, sparse data sets.

The idea of association rule mining is this: when events occur together more often than one would expect form their individual rates of occurrence, such cooccurrence is an interesting pattern.

An association is simple the co-occurrence of two or more things.

The support for a set of items is the proportion of all transactions that contain the set.

The confidence is the support for the co-occurrence of all items in a rule, conditional on the support for the left-hand set along. Support (X and Y) | Support (X)

The lift is the support of a set conditional on the joint support of each element, Support (X and Y) | Support(X) Support(Y). Say lift = 50, which indicates the combination occurs 50 times more often than we would expect if the two items were independent.

#Package arules  provides both data structures for efficient handling of sparse binary data as well as interfaces to implementations of Apriori and Eclat for mining frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules. 
require(arules)

#example 1: creating transactions form a list;
a_list <- list(
c("a","b","c"),
c("a","b"),
c("a","b","d"),
c("c","e"),
c("a","b","d","e")
)

names(a_list) <- paste("Tr",c(1:5), sep = "")
a_list

trans <- as(a_list"transactions")
summary(trans)
image(trans)


## example 2: creating transactions from a matrix
a_matrix <- matrix(
c(1,1,1,0,0,
1,1,0,0,0,
1,1,0,1,0,
0,0,1,0,1,
  1,1,0,1,1), ncol = 5)

dimnames(a_matrix) <- list(
c("a","b","c","d","e"),
paste("Tr",c(1:5), sep = ""))
a_matrix
trans2 <- as(a_matrix"transactions")
trans2
inspect(trans2)

## example 3: creating transactions from data.frame
a_df <- data.frame(
age = as.factor(c(6,8,7,6,9,5)),
grade = as.factor(c(1,3,1,1,4,1)))
## note: all attributes have to be factors
a_df
## coerce
trans3 <- as(a_df"transactions")
image(trans3)


#5.1. Example 1: Analyzing and preparing a transaction data set;
data("Epub")
summary(Epub)
year <- strftime(as.POSIXlt(transactionInfo(Epub)[["TimeStamp"]]), "%Y")
table(year)
Epub2003 <- Epub[year == "2003"]
length(Epub2003)
image(Epub2003)
transactionInfo(Epub2003[size(Epub2003) > 20])
inspect(Epub2003[1:5])
as(Epub2003[1:5], "list")

EpubTidLists <- as(Epub"tidLists")
EpubTidLists
as(EpubTidLists[1:3], "list")

#5.2. Example 2: Preparing and mining a questionnaire data set;
data("AdultUCI")
dim(AdultUCI)
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL

AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),labels = c("Young""Middle-aged""Senior""Old"))
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],c(0,25,40,60,168)),labels = c("Part-time""Full-time""Over-time""Workaholic"))
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capital-gain"]]>0]),Inf)),labels = c("None""Low""High"))
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],c(-Inf,0,median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capital-loss"]]>0]),Inf)),labels = c("none""low""high"))

Adult <- as(AdultUCI"transactions")
Adult
summary(Adult)
itemFrequencyPlot(Adultsupport = 0.1cex.names=0.8)
rules <- apriori(Adultparameter = list(support = 0.01confidence = 0.6))
rules
summary(rules)
rulesIncomeSmall <- subset(rulessubset = rhs %in"income=small" & lift > 1.2)
rulesIncomeLarge <- subset(rulessubset = rhs %in"income=large" & lift > 1.2)
inspect(head(SORT(rulesIncomeSmallby = "confidence"), n = 3))
inspect(head(SORT(rulesIncomeLargeby = "confidence"), n = 3))
WRITE(rulesIncomeSmallfile = "data.csv"sep = ","col.names = NA)

#5.3. Example 3: Extending arules with a new interest measure;
data("Adult")
fsets <- eclat(Adultparameter = list(support = 0.05), control = list(verbose=FALSE))
singleItems <- fsets[size(items(fsets)) == 1]
singleSupport <- quality(singleItems)$support
names(singleSupport) <- unlist(LIST(items(singleItems), decode = FALSE))
head(singleSupportn = 5)

itemsetList <- LIST(items(fsets), decode = FALSE)
allConfidence <- quality(fsets)$support / sapply(itemsetListfunction(x) max(singleSupport[as.character(x)]))
quality(fsets) <- cbind(quality(fsets), allConfidence)

fsetsEducation <- subset(fsetssubset = items %pin"education")
inspect(SORT(fsetsEducation[size(fsetsEducation)>1], by = "allConfidence")[1 : 3])

#5.4. Example 4: Sampling
data("Adult")
Adult

supp <- 0.05
epsilon <- 0.1
c <- 0.1
n <- -2 * log(c)/ (supp * epsilon^2)
n

AdultSample <- sample(Adultnreplace = TRUE)
itemFrequencyPlot(AdultSamplepopulation = Adultsupport = suppcex.names = 0.7)
itemFrequencyPlot(AdultSamplepopulation = Adultsupport = supplift = TRUEcex.names = 0.9)
time <- system.time(itemsets <- eclat(Adultparameter = list(support = supp), control = list(verbose = FALSE)))


## Application;
install.packages("chron")
install.packages("ggplot2")
install.packages("plyr")
install.packages("reshape2")
install.packages("arules")
install.packages("arulesViz")

#Analytical packages;
require(chron)
require(ggplot2)
require(plyr)
require(reshape2)
require(arules)
require(arulesViz)
require(reshape2)

#Association Rules;
load(file = "result5.rda")
result4$user_cumcnt <- as.integer(result4$user_cumcnt)
result4$user_cnt <- as.integer(result4$user_cnt)

#dropped robot;
#one user one basket without duplicate categories;
basic <- data.frame(result4[result4$user_cnt<=60, c(1,15)])
#dedupe due to merge of conv. category
basic <- unique(basic)
visit_cmatrix <- dcast(basic, userid ~ category)
cmatrix <- visit_cmatrix[,-1]

for (i in 1: length(cmatrix[1,])){
cmatrix[,i] <- ifelse(is.na(cmatrix[,i]),NA,1)
cmatrix[,i] <- as.factor(cmatrix[,i])
}

tmp <- as(cmatrix, "transactions")
tmp
summary(tmp)
inspect(tmp[1:10])
itemFrequencyPlot(tmp, support = 0.0001, cex.names=0.9)
#junk<-sort(itemFrequency(tmp))
#txt <- rep("",length(junk))
#for (i in 1:length(junk)) txt[i] <- strsplit(names(junk[i]), split="=")[[1]][1]
#plot1 <- barplot(junk, names = txt, col='blue', cex.names=0.5)
#text(plot1, (1:length(junk)) + par("cxy")[2], junk, cex = .8)


rules <- apriori(tmp, parameter = list(support = 0.001, confidence = 0.005))
rules
summary(rules)
#support --> P(A&B)
summary(quality(rules)$support)
#confidence --> P(A&B)/P(A)
summary(quality(rules)$confidence)
#lift --> P(A&B)/P(A)P(B)
summary(quality(rules)$lift)

inspect(head(SORT(rules, by = "confidence"), n = 10))
inspect(head(sort(rules, by = "lift"), n = 10))
plot(rules, method = NULL, measure = "support", shading = "lift", interactive = FALSE, data=tmp)

plot(rules, measure=c("support", "lift"), shading="confidence", interactive=TRUE, data = tmp)

#Scatter plot
plot(rules)
head(quality(rules))
plot(rules, measure=c("support", "lift"), shading="confidence")

#Matrix-based visualizations
subrules <- rules[quality(rules)$confidence > 0.5]
plot(subrules, method="matrix", measure="lift")
plot(subrules, method="matrix", measure="lift", control=list(reorder=TRUE))
plot(subrules, method="matrix3D", measure="lift")

plot(rules, method="grouped")
plot(rules, method="grouped", control=list(k=50))

rulesRev <- new("rules", lhs=lhs(subrules), rhs=rhs(subrules))
m <- match(subrules, rulesRev, nomatch=0)
result <- data.frame(rules = labels(subrules), quality(subrules))
result$rules <- as.character(result$rules)
lhs <- do.call(rbind,strsplit(result[,1], " => "))
result <- cbind(lhs,result[,2:4])

plot <- data.frame(result[,c(1,2,5)])
names(plot) <- c("start", "end", "weight")
g <- graph.data.frame(plot, directed = T)
g
V(g)$name
E(g)$weight
ideg <- degree(g, mode = "in", loops = F)
#col=rainbow(12) # For edge colors

l <-layout.reingold.tilford(g)
l[,1] <- 0:15
l[,2] <- 0:15
plot.igraph(g,
            vertex.label = V(g)$name,
            vertex.label.color = "gray20",
            vertex.size = ideg*5 + 10,
            vertex.size2 = 10,
            edge.arrow.size=0.5,
            edge.width = 2,
            edge.curved = T,
            layout = l)

Tuesday, October 27, 2015

7 - Online Marketing Tracking Engagement

Cookies

Cookies are a simple technology that have been around since the early days of the web. They are pieces of code that web servers use to put information on a user’s browser, and then retrieve that information at a later time for various uses. For example, YouTube sets a cookie on your browser to remember your user name and volume settings so you don’t have to input those every time. Cookies are privacy conscious by design, so that only the server domain that sets a cookie is able to retrieve it.

Ad servers (such as DoubleClick or Atlas) use cookies to set unique IDs so they can identify the same user across multiple touchpoints. When an ad server receives an ad display request from a user who does not have an existing cookie, the ad server assigns a new unique ID (a random alpha-numeric string such as 118D132F9423). On each subsequent request the cookie returns the same unique ID, thus allowing the ad server to know that it is the same user. Because all requests are recorded by the ad server, reports can be created that provide a record of all the touchpoints for each user.

Click Redirects

For media running outside of the ad server, such as paid search, one approach is to apply click redirects within the URL. These briefly send the user through the ad server to be cookied and then shoot them back out to their desired landing page. This happens in a fraction of a second and is barely perceptible by the user. Because that intermediate hop is on the ad server domain, it can log the data needed to record and read the cookie for the unique ID.

Pixel Tags

For media without a click event (such as navigating to a site directly), click redirects cannot be used. In those cases, the solution is to use pixel tags (also known as tags, 1x1 pixels, or web bugs). Pixel tags are typically single pixel, transparent GIF images that are added to a web page. Even though the pixel tag is virtually invisible, it is still served just like any other image you may see online. The trick is that the web page is served from the site’s domain while the image is served from the ad server’s domain. This allows the ad server to read and record the cookie with the unique ID and the extended information it needs to record.

Tag Containers

Finally, sometimes tag containers (also called container tags) must be used. Tag containers hold JavaScript code which contains one or more pixel or click tags. These tag containers can control the information being passed into images, and can decide whether or not to “fire” or show the image tag, which would trigger the ad server to record an impression with the unique ID. A common use for tag containers is to tag users from unpaid impressions such as organic search visits and direct traffic. The attribution process often uses JavaScript to record the referring URL or other information needed to identify and record otherwise untracked conversion paths.

As you can see, there are a host of technologies used to track user engagement stacks for attribution measurement. Once all the touchpoints for a user are recorded by an ad server, attribution software can consume the log files that ad servers provide to do the work necessary to appropriately assign fractional credit to each touchpoint that contributed to a desired outcome.

Thursday, October 15, 2015

6 - Cookie Profile - Quantcast

Time Period under study

Typical visit frequency
Consider the case of a user who deletes cookies every day. If a particular site has a group of very addicted uses who return frequently, even if a small number delete cookies daily, the result will significantly inflated numbers of cookies relative to the number of actual people who visited the site. In such cases it is not unusual to see an average of two or more cookies for every user over the courses of a month. Of course, most users only result in one cookie, but a small number generate many cookies. 

Some site tend to have mostly passers-by, i.e., visitors that only go to the site once over a given time period. No impact on the total number of cookies for the site.

Passers-by =1
Regulars = 2-29
Addicts = 30+

“Reference Site”
These sites have a relatively low frequency of visit profile. 

“High-Profile Blog”
These sites have a relatively high frequency of visit profile. A small number of addicts generate the majority of site visits.

“Social Network”

These sites have heavy addict consumption. These networks are relatively small.

5 - A survey of CF Techniques

1 Introduction
Challenge:
To deal with highly sparse data
To scale with the increasing numbers of users and items
To make satisfactory recommendations in a short time period
To deal with problems like synonymy, i.e., same or similar items to have different names
To deal with shilling attacks, data noise, privacy protections
Methods:
Memory-based CF methods – similarity or weight btw users or items and make predictions or recommendations based on calculated similarity values. 
Model-based CF methods – use only rating data to estimate or learn a model to make predictions. Bayesian belief nets CF, clustering CF models, latent semantic CF models, Markov decision process based CF.
Content-based filtering – analyzing the content of textual information and filing regularities in the content by using features of users and items.
Hybrid CF – combine content-boosted CF and content-based techniques.

2 Characteristics and Challenges of CF
Challenge:
2.1 Data Sparsity.
Cold start problem (new user problem or new item problem) occurs when a new user or item has just entered the system. It is difficult to find similar ones because there is not enough information. New items cannot be recommended until some users rate it, and new users are unlikely given good recommendations because of the lack of their rating or purchase history.
To alleviate the data sparsity problem, many approaches have been proposed:
Singular Value Decomposition (SVD) removes unrepresentative or insignificant users or items to reduce the dimensionalities of the user-item matrix directly. Latent Sematic Indexing (LSI) used SVD to calculate similarity between users by the representation of the users in the reduced space.
Principle Component Analysis (PCA), a closely-related factor analysis techniques to reduce dimensionality.

Model-based CF to address the sparsity problem
Naïve Bayes optimized by extended logistic regression providing more accurate predictions for sparse data.
Association retrieval technique applies an associative retrieval framework to explore transitive associations among users through their rating and purchase history.
Maximum margin matrix factorizations, Multiple imputation-based CF, Imputation-boosted CF 

Hybrid CF to address the sparsity problem
Aspect model latent variable method for cold start combines both collaborative and content information in model fitting.
Probabilistic model to address the cold start problem classifies items into groups and predictions are made for users considering the Gaussian distribution of user ratings.

2.2 Scalability.
Numbers of existing users and items grow tremendously with computationally resources going beyond practical or acceptable levels.
SVD deal with the scalability problem and quickly produce good quality recommendations but they have to undergo expensive matrix factorization steps.

Memory-based CF 
Item-based Pearson CF address scalability by calculating similarities btw only pairs of co-rated items instead of all pairs.

Model-based CF
Clustering CF address the scalability problem by seeking users for recommendation within smaller and highly similar clusters instead of the entire database.

2.3 Synonymy. 
It refers to the tendency of a number of the same or every similar items to have different names or entries.
 Singular Value Decomposition (SVD) constructs a sematic space where terms and documents that are closely associated are placed closely to each other. 

2.4 Gray Sheep.
It refers to the users whose opinions do not consistently agree or disagree with any group of people and thus do not benefit from CF.
Hybrid approach can combine content-based and CF recommendations by basing a prediction on a weighted average of the content-based prediction and the CF prediction.

2.5 Shilling Attacks.
People may give tons of positive recommendations for their own materials and negative recommendations for their competitors.
Item-based CF was less affected by the attacks than the user-based CF.
Remove global effects in the data normalization stage of the neighbor-based CF, and working with residual of global effects to select neighbors.

2.6 Others
1 Protect users’ privacy
2 Increased noise / sabotage
3 Explain-ability
2.7 The Netflix Prize Challenge.
2009 Merged Model of latent factor and neighborhood models.

3 Memory-Based Collaborative Filtering Techniques

To generate a top-N recommendation, need to find k most similar users or items (nearest neighbors) after computing the similarities, then aggregate the neighbors to get the top-N most frequent items as the recommendation.

3.1 Similarity Computation.
3.1.1 Correlation-Based Similarity
For the user based algorithm, wu,v
For the item based algorithm, wi,,j
Pearson correlation
Spearman rank correlation

3.1.2 Vector Cosine-Based Similarity
Vector cosine similarity
Adjusted cosine similarity has the same formula as Pearson correlation

3.2 Prediction and Recommendation Computation
3.2.1 Weighted Sum of Others Ratings.
Take weighted average of all the ratings on the item across all users.

3.2.2 Simple Weighted Average
 Take weighted average of all the ratings on the user cross all items.

3.3 Top-N Recommendations.
3.3.1 User-Based Top-N Recommendation Algorithms.
Look for k most similar users.
Aggregate to identify set of items based on frequency within users.
Recommend the top-N most frequent items the active user not purchased yet.

3.3.2 Item-Based Top-N Recommendation Algorithms.
First Look for k most similar items.
Then identify the set as candidates of recommended items by taking the union of the k most similar items.
And remove each of the items in the set that the user has already purchase.

3.4 Extension to Memory-Based Algorithms
3.4.1 Default Voting.
Pairwise similarity is computed only from the ratings sin the intersection of the items both users have rated.
Focus on intersection set similarity neglects the global rating behavior reflected in a user’s entire rating history will not be reliable.

3.4.2 Inverse User Frequency.
Universally liked items are not as useful in capturing similarity as less common items. fj=log(n/nj)

3.4.3 Case Amplification
Reduce noise in the data. It tends to favor high weights as small values raised to a power become negligible.

3.4.4 Imputation-Boosted CF.
Fit in the missing data.
IBCF use naïve bayes perform especially well, outperforming the content-boosted CF algorithm (a representative hybrid CF), and do so without using external content information.

3.4.5 Weighted Majority Prediction.
 Make prediction using the rows with observed in the same column, weighted by the believed similarity between the rows, with binary rating values.

4 Model-Based CF
4.1 Bayesian Belief Net CF
4.1.1 Simple Bayesian CF
For incomplete data, the probability calculation and classification production are computed over observed data.
The Laplace estimator is used to smooth the probability calculation and avoid a conditional probability of o.

4.1.2 NB-ELR and TAN-ELR CF
ELR: Extended logistic regression
TAR-ELR: tree augmented naïve Bayes
NB-ELR: naïve Bayes optimized by ELR

4.1.3 Other Bayesian CF.
Bayesian belief nets with decision trees at each node.

4.2 Clustering CF.
A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.
Similarity btw objects: Minkowski distance, pearson correlation
Partitioning method: k-means
Density-based method: DBSCAN
Hierarchical clustering method: BIRCH

Clustering models have better scalability than typical CF.

4.3 Clustering CF
Y is sparse. 
Sparse factor analysis replaces missing elements with default voting values, which replaces missing elements with default voting values.

4.4 MDP-based CF
MDPs: Markov decision processes (MDPs), a model for sequential stochastic decision problem.
(S, A, R, Pr)
S: state
A: action
R: reward function, i.e., net profit
Pr: transition probability

POMDPs: partial observable MDP

4.5 Latent Semantic CF
Latent class variables in a mixture model setting to discover user communities and prototypical interest profiles.
Higher accuracy and scalability.
An aspect model is a probabilistic latent-space model, which models individual ratings as a convex combination of rating factors.
Latent Dirichlet Allocation (LDA, a generative probabilistic model, in which each item is modeled as a finite mixture over an underlying set of users).

4.6 Other Model-Based CF 
Two-stage order learning CF
Association rule based CF
Maximum entropy approach
Dependency network
Multiple multiplicative factor models (MMFs)
Probabilistic principal components analysis (pPCA)

5 Hybrid CF
Content-based techniques have the start-up problem.
Demographic-based recommender systems
Utility-based recommender systems
Knowledge-based recommender systems

5.1 Hybrid Recommenders Incorporating CF and Content-Based Features.
Content-boosted CF
Baysian preference model

5.2 Hybrid Recommenders Combining CF and Other Recommender Systems.
Weighted hybrid recommender
Weighted majority voting
Weighted average voting
Switching hybrid recommender
Mixed hybrid recommenders
Cascade hybrid recommenders

5.3 Hybrid Recommenders Combining CF
PMCF: Probabilistic memory-based collaborative filtering
PD: Personality diagnosis

6 Evaluation Metrics
6.1 Mean Absolute Error(MAE) & Normalized MAE
6.2 RMSE
6.3 ROC sensitivity

7 Conclusions



4 - Evaluating Online Ad Campaigns in a Pipeline: Causal Models At Scale

1 INTRODUCTION
Changes over time:
Ad format/ ad content (campaigns, creative, sites, etc.),
Impression delivery rates,
Narrowly / broadly audience targeting,
Geo targeting changes, etc.

Traditionally, online campaign effectiveness has been measured by clicks because someone who interacts with an ad was affected by it.
Immediate responses, i.e., counting clicks alone miss the value of a campaign.
Long-term interests in the brand by counting visitors overstate campaign effectiveness.
Changing in online brand interest that can be attributed to the display ad campaign alone can be a better metrics. User searches for brand terms or navigates to brand sties can be attributed to an online ad campaign.

Propensity scoring and doubly robust estimation methods are used to protect against residuals, hidden selection bias, etc. The test statistics for an outcome of interest to the advertiser is judged against a null distribution that is based on a set of outcomes that are irrelevant to the advertiser.

2 Causal effects
Counterfactuals are what would have happened had the campaign not been run, which is fundamental to understanding the analysis of observational data found in the databases.
Y0: user outcome if not shown a campaign ad  control group
Y1: user outcome if shown a campaign ad  test group
Selection bias: more active web users are more likely to be exposed, and less active web users are more likely to be unexposed and hence potential controls, even if they satisfy the targeting conditions for the campaign.
In short, the ad serving process leads to selection bias or differences in the control and exposed that are related to the distribution of the outcomes (Y0,Y1).

3 The Controls
The first ad served to an exposed user breaks its timeline into before and after exposure periods. Checking the control and exposed before their first exposure for a campaign.
Advertiser’s own campaign information
Start dt, end dt, etc.
Ad serving logs
Ranking, frequency cap, etc.
Demo info & Geo Info
Country, language
Historical activity before
# of navigations
# of navigations to advertiser’s site before exposed
# first exposure time
# number of campaign ads served to user during the analysis period.

4 Propensity Scores
If a control and exposed user were identical before exposure, then it is reasonable to assume that they would have had the same probability of showing interest in the brand after exposure if the campaign had not been run.
P(Y0 =1 |X, exposed) = P(Y0=1 | X, control)

4.1 Propensity Matching
P(X) = P(exposed | X)
Removes selection bias whenever matching on X itself removes selection bias.
Matching on a consistent estimate Phat can remove selection bias better than matching on X or P(X).
Controls and exposed with similar propensities can be grouped together and mean differences within groups averaged.

4.2 Inverse Propensity Weighting
There is selection bias because a feature is correlated with both sample selection and the outcome, but re-weighting the data with weights inversely proportional to the probability of selection removes the bias.
Control: Wi = Phat(Xi)/(1-Phat(Xi))
Exposed: Wi = 1
Weighting the exposed by 1/Phat(X) and the controls by 1/(1-Phat(X)) is appropriate when the goal is to estimate the potential campaign effect on all users.
Delta(IPW)

5 Doubly Robust Estimation
The estimated Delta (DR) is doubly robust because it remains consistent if the outcome models are wrong but the propensity model is right or if the propensity model is wrong but the outcome models are right, although in those cases there is no guarantee that Delta(DR) has minimum asymptotic variance.

6 Hypothesis Testing
T(DR) = Delta(DR)/S(DR) follows a standard normal(0,1).
A fully non-parametric test is obtained by rejecting H0 only if T(DR) is larger than the quantile of the empirical null distribution.

7 Model Selections and Validation
Primary goal is to balance X across controls and exposed, in the sense that the distribution of X conditional on Phat(X,Z) should be independent of Z even for features of X that are not included in the fitted propensity model.
ANOVA F-test
GAM model
8 Experiments


9 Discussions

3 - Linkedin Talk: Claudia Perlich: Data Science and Optimization

Problem setup:
1 Ads exchange system: real time bidding
How much to bid? If win, what ads to show?
2 Observe view through conversion rate
Group content, reduce dim.
Quality control for incoming data.
Attribution question, even independent effect, who get credits if multiple targeter show ads?
Goal:
Evaluate every opportunity (browser showing up inventory, which is URL currently user on and try to show an ads), and given an campaign, decide $ to bid.
Procedure:
1 Depend on deals (CPM, CPA). Measure the performance by cvr. 
2 Data, 100M users in 100M URLs, 50K inventory(buckets), 300 Concurrent campaigns.

How to reduce Dim? 100M URLs with low base rates. Real time scoring. Scalable. How to sampling?

0:15:10-User churn rate is high. Half of the people disappear within a day, which is similar to our results in the user churn analysis.

0:17:27 Three models:
Model 1: How to get signals from 100M URLs(Dim reduction)? 20 URLS per user on average.
Unconditional Brand Process. What is Prob(conv)? Conversion probability independent of ads. 

Take every single positive event. Augment to get URL visits to get negative sets, b/c cant sampling by user.
Naïve Bayes, sparse logistic regression to get scores.
Adding new good data might cause problem. Timing is important. Facts with 30 days history and facts with 10 days history contain different volume and signals. Match by sampling. 

Model 2: Conditional on showing you ads, what is Prob(conv|imp)?
Frequency * scores as inputs with 20 dim variables.

Show ads for RON users randomly. Prevent from targeters (??exclude retargeting user sets).
Performance comparions with RON.

Cluster URLs. Pre-filtering URLs, calculate correlations based on campaigns and scores. Work for small campaigns.
Model 3: Bidding.
First, nothing to ads, reinforce belief of the match.
Second, current intention.
Third, certain sites indicate cookies likely delete soon. Show ads won’t get credits for those users. Survival rate low, less probability to show ads.
Fourth, some websites are invisible.
Fifth, perceptiveness. The gorilla experiment. Focus on specific task, perceptions on other things completely out. http://www.theinvisiblegorilla.com/gorilla_experiment.html

0:47:48 real time bidding world use second price rule (highest bidder wins, pays second highest bid price). Theory suggests truthful bid your true value. 
Experience: Bid $2, b/c pay volume. No true value exists. AM decides, adjusted by baseline probability. 
$2  CPM $5 * (1-margin 60%). 
Locally preference number by adjusting it by cvr, say, good reason to bid twice.
Strategy 1: price set by AM.
Strategy 2: price set * ratio, which is estimated by Pr(conv|segment, inventory)/Pr(conv|segment), which indicates the inventory value to tell good inventory and bad inventory. Economic efficient model, pay as much as you are able to get.
Strategy 3: greedy bid. Only bid when it is good, but bid a lot.

Conclusion: both strategy 2 and 3 got higher cvrs. Strategy 2 got evenly equal price as Strategy 1. Don’t pay more, but look better. Strategy 3 has higher CPA. 

Site visits, clicks, purchases. Site visits almost always better than clicks. 70% of site visits better than purchases b/c bias-variance tradeoff.

Control trafficking by segment. Pr(conv|segment, inventory)/Pr(conv|segment
0.55.19 Worst place to go for inventory is social media, Facebook. 

Lift about 3-5X over RON.
ROC plot shows that sensitivity (% of correctly classified positive observations) and specificity (% of correctly classified negative observations) as the output threshold is moved over the range of all possible values. In the ROC context, the area under the curve (AUC) measures the performance of a classifier and is frequently applied for method comparison.



2 - Overcoming Browser Cookie Churn with Clustering

WSDM 2012

Group similar cookies into clusters that are more persistent than individual cookies. 

Such clustering could potentially allow more robust estimation of the number of unique visitors of the site over a certain long time period, and also better user modeling which is key to plenty of web applications such as advertising and recommender systems.

Cluster browser cookies into groups that are likely to belong to the same browser based on a statistical model of browser visitation patterns.

Generalized interval graph coloring problem and a greedy heuristic algorithm for solving it.

When a web browser visits a website for the first time, the website creates and assigns the browser a unique identifier (cookie), which is sent to the website on consecutive visits until it expires or is deleted by user through the browser interface or otherwise.

1 INTRODUCTION 

Metrics:
1 the number of unique visitors 
2 the average engagement time of a user
3 the number of repeat visitors to the site

Cookie-based unique visitor counting system are widely implemented by many websites, usually overestimate the number of unique visitors by large factors. We refer to problem of frequent cookie deletion as “cookie churn”.

Recently-created cookies have little history to learn from, aka, cold start problem, which negatively affects model performance. 

Estimate cookie visitor ratio, and then infer the number of unique visitors to site based on the number of unique cookies and these ratios. A similar work tries to find the cookie-visitor or IP-visitor ratios and uses these to normalize the number of cookies.

Such ratio-based methods are applicable only to coarse grained user counting and not to the user behavioral modeling or the reach and frequency problems.

We aim to construct similar browser fingerprints, but based on features that are much more benign and less discriminative.

Approach:
User-modeling and reach-frequency estimation from ad campaign is to create clusters of cookies based on a behavioral profile - each cluster ideally will represent all and only the cookies associated with a single browser.
The lifetime of cookies is defined as time elapsed between cookie creation and the last page view.

Challenges:
First, the solution has to scale well to be able to process hundreds of millions of cookies created every month at major websites.
Second, since each cluster represents a browser, the number of clusters is unknown - in fact, one of the outputs of our method is an estimate of the number of underlying browsers. 
Third, the large websites that have millions of simultaneous visitors, the number of cannot-link constraints is quadratic in the number of cookies, stressing scalability even more.
Fourth, the amount of useful information in their profiles is low, making it hard to extract meaningful cookie profiles and to design accurate similarity measure among these profiles.

Solutions
Each cookie is an interval defined by its lifetime.
Partition the graph into the provably minimal number of groups of cookies where the lifetimes of all cookies in a group do not overlap.
Predict the likelihood of a cookie to belong to one of the several candidate clusters. Cookie profile features such as cookie lifetime, number of page views, OS type, distribution of IP addresses, categories of visited webpages, etc.  Profile similarity measures include L2 distance and Embedding Bayes factors which learned through logistic regression.

Contribution
Formalize the problem of dealing with cookie churn by creating clusters of cookies using a constrained clustering framework.
Define a simple browser model for cookie lifetimes and also uses real-world data to validate our assumption. 
Define a set of cookie profile features and propose and efficient and scalable clustering algorithm based on interval graph coloring and cookie profile similarity measures.
Evaluate our algorithms on the real-world Y! data and show that it outperforms various baseline methods.

2 COOKIE CLUSTERING

LEMMA1 Suppose a browser has the visit-distribution Exp(\lamda) and the cookie-cleaning probability p for each visit. Suppose the first visit time and last visit time of the i-th cookie from this browser are denoted by f_i and l_i respectively and WLOG let f_i=0. Then,
1 The fraction of cookies that have lifetime d_i =0 is p^2.
2 For the cookies with positive lifetime, the lifetime d_i is distributed as Exp(\lambda.p)
3 The intervals f_(i+k) - l_i is the first cookie following i that has positive lifetime, are distributed as Exp((1-p)*\lambda).



1 - Real Time Bid Optimization with Smooth Budget Delivery in Online Advertising

ABSTRACT

Restricted by the budget, the goal is to buy a set of ad impressions to reach as many targeted users as possible.
An online approach to the smooth budget delivery while optimizing for the conversion performance.

1 INTRODUCTION
RTB exchanges provide a technology for advertisers to algorithmically place a bid on any individual impression through a public auction. This functionality enables advertisers to buy inventory in a cost effective manner, and serve ads to the right person in the right context at the right time.

Demand-side platforms (DSPs) offer such a solution called real time bid optimization to help advertisers find the optimal bid value for each ad request.

The process of real time bid optimization tries to maximize the campaign performance goal under the delivery constraint within the budget schedule. 
Performance goal: minimizing cost-per-click(CPC) or cost-per-action(CPA); maximizing click-through-rate(CTR) or conversion-rate (CVR). Under the constraint: smooth budget.

2 BACKGROUND AND RELATED WORK

Two situations won’t occur:

Premature Campaign Stop: Advertisers do not want their campaigns to run out of the budget prematurely in the day so as not to miss opportunities for the rest of day.

Fluctuation in Spend: Advertisers would like to be able to analyze their campaigns regularly and high fluctuations in the budget make the consistency of the results questionable.

Two main issues:

Traffic Issue: Depending on the target audience, the volume of the online traffic varies a lot throughout the day.
Performance Issue: The quality of the online traffic changes over the course of the day for different groups of audience.

3 ONLINE BID OPTIMIZATION

Two different classes of campaigns: 

flat CPM campaign: CTR or CVR

dynamic CPM campaign: CPC or CPA

3.1 Smooth Delivery of Budget

S(t) is the dollar amount of money spent
reqs(t) is the number of incoming ad requests that satisfy the audience targeting constraints of the campaign
bids(t) is the number of ad requests that the campaign has bid on
imps(t) is the number of impressions of the campaign

pacing rate(t) = bids(t)/reqs(t)
win_rate(t) = imps(t)/bids(t)

3.2 Selection of High Quality Ad Requests - Flat CPM Campaigns

flat CPM campaign: CTR or CVR

win_rate(t) = imps(t)/bids(t)

goal is to simply select a set of ad requests to bid on considering the current time slot pacing rate
imps*(t) = s(t)/c*

For flat CPM campaigns that always submit a fixed bid price c* to RTB exchanges, the goal is to simply selct a set of ad requests to bid on considering the current time slot pacing rate.

In all, its CTR or CVR value is first estimated by the statistical model. If the predicted value is larger than the upper bound of the threshold, this ad request will be kept and the fixed bid price v* will be submitted to the RTB exchange. If the predicated value is smaller than the lower bound of the threshold, this ad request will be simply dropped without further processing.

3.3 Selection of High Quality Ad Requests - Dynamic CPM Campaigns

dynamic CPM campaign: CPC or CPA

For dynamic CPM campaigns are free to change the bid price dynamically for each incoming ad request, the goal is to win enough number of high quality impressions for less cost.

Base bid price ui = CVR*G

Thresholds beta1 and beta2:
Safe region: no under-delivery
Critical region: delivery is normal
Danger region: campaign has a hard time to find and win enough impressions.

For the case where our campaign is in the critical region, submitted ci_hat = u_i

For the case where our campaign is in the safe region, submitted ci_hat=theta*u_i, and paid ci, compute histogram of theta=ci/ci_hat.

For the case where our campaign is in the danger region, 
If the audience targeting constraints are too tight and hence, there are not enough incoming ad requests selected for a bid, nothing could do.
If the bid price is not high enough to win the public auction even if we bid very frequently, boost the bidding price.

3.4 Estimation of CTR and CVR

Firstly, the estimated CVR provides a quality assessment for each ad request helping to decide on taking action on that particular ad request. 
Secondly, the base bid price is set to be the estimated CVR multiplied by the CPA goal, which directly affects the cost of advertising.

The hierarchy starts with the root and continues layer after layer by advertiser category, advertiser, insertion order, package, lineitem, ad and creative.

4 PRACTICAL ISSUES

4.1 Cold Start Problem

4.2 Prevention of Overspending

4.3 Distributed Architecture


5 EXPERIMENTAL RESULTS

5.1 Comparison of Pacing Strategies