Thursday, October 15, 2015

4 - Evaluating Online Ad Campaigns in a Pipeline: Causal Models At Scale

1 INTRODUCTION
Changes over time:
Ad format/ ad content (campaigns, creative, sites, etc.),
Impression delivery rates,
Narrowly / broadly audience targeting,
Geo targeting changes, etc.

Traditionally, online campaign effectiveness has been measured by clicks because someone who interacts with an ad was affected by it.
Immediate responses, i.e., counting clicks alone miss the value of a campaign.
Long-term interests in the brand by counting visitors overstate campaign effectiveness.
Changing in online brand interest that can be attributed to the display ad campaign alone can be a better metrics. User searches for brand terms or navigates to brand sties can be attributed to an online ad campaign.

Propensity scoring and doubly robust estimation methods are used to protect against residuals, hidden selection bias, etc. The test statistics for an outcome of interest to the advertiser is judged against a null distribution that is based on a set of outcomes that are irrelevant to the advertiser.

2 Causal effects
Counterfactuals are what would have happened had the campaign not been run, which is fundamental to understanding the analysis of observational data found in the databases.
Y0: user outcome if not shown a campaign ad  control group
Y1: user outcome if shown a campaign ad  test group
Selection bias: more active web users are more likely to be exposed, and less active web users are more likely to be unexposed and hence potential controls, even if they satisfy the targeting conditions for the campaign.
In short, the ad serving process leads to selection bias or differences in the control and exposed that are related to the distribution of the outcomes (Y0,Y1).

3 The Controls
The first ad served to an exposed user breaks its timeline into before and after exposure periods. Checking the control and exposed before their first exposure for a campaign.
Advertiser’s own campaign information
Start dt, end dt, etc.
Ad serving logs
Ranking, frequency cap, etc.
Demo info & Geo Info
Country, language
Historical activity before
# of navigations
# of navigations to advertiser’s site before exposed
# first exposure time
# number of campaign ads served to user during the analysis period.

4 Propensity Scores
If a control and exposed user were identical before exposure, then it is reasonable to assume that they would have had the same probability of showing interest in the brand after exposure if the campaign had not been run.
P(Y0 =1 |X, exposed) = P(Y0=1 | X, control)

4.1 Propensity Matching
P(X) = P(exposed | X)
Removes selection bias whenever matching on X itself removes selection bias.
Matching on a consistent estimate Phat can remove selection bias better than matching on X or P(X).
Controls and exposed with similar propensities can be grouped together and mean differences within groups averaged.

4.2 Inverse Propensity Weighting
There is selection bias because a feature is correlated with both sample selection and the outcome, but re-weighting the data with weights inversely proportional to the probability of selection removes the bias.
Control: Wi = Phat(Xi)/(1-Phat(Xi))
Exposed: Wi = 1
Weighting the exposed by 1/Phat(X) and the controls by 1/(1-Phat(X)) is appropriate when the goal is to estimate the potential campaign effect on all users.
Delta(IPW)

5 Doubly Robust Estimation
The estimated Delta (DR) is doubly robust because it remains consistent if the outcome models are wrong but the propensity model is right or if the propensity model is wrong but the outcome models are right, although in those cases there is no guarantee that Delta(DR) has minimum asymptotic variance.

6 Hypothesis Testing
T(DR) = Delta(DR)/S(DR) follows a standard normal(0,1).
A fully non-parametric test is obtained by rejecting H0 only if T(DR) is larger than the quantile of the empirical null distribution.

7 Model Selections and Validation
Primary goal is to balance X across controls and exposed, in the sense that the distribution of X conditional on Phat(X,Z) should be independent of Z even for features of X that are not included in the fitted propensity model.
ANOVA F-test
GAM model
8 Experiments


9 Discussions

3 - Linkedin Talk: Claudia Perlich: Data Science and Optimization

Problem setup:
1 Ads exchange system: real time bidding
How much to bid? If win, what ads to show?
2 Observe view through conversion rate
Group content, reduce dim.
Quality control for incoming data.
Attribution question, even independent effect, who get credits if multiple targeter show ads?
Goal:
Evaluate every opportunity (browser showing up inventory, which is URL currently user on and try to show an ads), and given an campaign, decide $ to bid.
Procedure:
1 Depend on deals (CPM, CPA). Measure the performance by cvr. 
2 Data, 100M users in 100M URLs, 50K inventory(buckets), 300 Concurrent campaigns.

How to reduce Dim? 100M URLs with low base rates. Real time scoring. Scalable. How to sampling?

0:15:10-User churn rate is high. Half of the people disappear within a day, which is similar to our results in the user churn analysis.

0:17:27 Three models:
Model 1: How to get signals from 100M URLs(Dim reduction)? 20 URLS per user on average.
Unconditional Brand Process. What is Prob(conv)? Conversion probability independent of ads. 

Take every single positive event. Augment to get URL visits to get negative sets, b/c cant sampling by user.
Naïve Bayes, sparse logistic regression to get scores.
Adding new good data might cause problem. Timing is important. Facts with 30 days history and facts with 10 days history contain different volume and signals. Match by sampling. 

Model 2: Conditional on showing you ads, what is Prob(conv|imp)?
Frequency * scores as inputs with 20 dim variables.

Show ads for RON users randomly. Prevent from targeters (??exclude retargeting user sets).
Performance comparions with RON.

Cluster URLs. Pre-filtering URLs, calculate correlations based on campaigns and scores. Work for small campaigns.
Model 3: Bidding.
First, nothing to ads, reinforce belief of the match.
Second, current intention.
Third, certain sites indicate cookies likely delete soon. Show ads won’t get credits for those users. Survival rate low, less probability to show ads.
Fourth, some websites are invisible.
Fifth, perceptiveness. The gorilla experiment. Focus on specific task, perceptions on other things completely out. http://www.theinvisiblegorilla.com/gorilla_experiment.html

0:47:48 real time bidding world use second price rule (highest bidder wins, pays second highest bid price). Theory suggests truthful bid your true value. 
Experience: Bid $2, b/c pay volume. No true value exists. AM decides, adjusted by baseline probability. 
$2  CPM $5 * (1-margin 60%). 
Locally preference number by adjusting it by cvr, say, good reason to bid twice.
Strategy 1: price set by AM.
Strategy 2: price set * ratio, which is estimated by Pr(conv|segment, inventory)/Pr(conv|segment), which indicates the inventory value to tell good inventory and bad inventory. Economic efficient model, pay as much as you are able to get.
Strategy 3: greedy bid. Only bid when it is good, but bid a lot.

Conclusion: both strategy 2 and 3 got higher cvrs. Strategy 2 got evenly equal price as Strategy 1. Don’t pay more, but look better. Strategy 3 has higher CPA. 

Site visits, clicks, purchases. Site visits almost always better than clicks. 70% of site visits better than purchases b/c bias-variance tradeoff.

Control trafficking by segment. Pr(conv|segment, inventory)/Pr(conv|segment
0.55.19 Worst place to go for inventory is social media, Facebook. 

Lift about 3-5X over RON.
ROC plot shows that sensitivity (% of correctly classified positive observations) and specificity (% of correctly classified negative observations) as the output threshold is moved over the range of all possible values. In the ROC context, the area under the curve (AUC) measures the performance of a classifier and is frequently applied for method comparison.



2 - Overcoming Browser Cookie Churn with Clustering

WSDM 2012

Group similar cookies into clusters that are more persistent than individual cookies. 

Such clustering could potentially allow more robust estimation of the number of unique visitors of the site over a certain long time period, and also better user modeling which is key to plenty of web applications such as advertising and recommender systems.

Cluster browser cookies into groups that are likely to belong to the same browser based on a statistical model of browser visitation patterns.

Generalized interval graph coloring problem and a greedy heuristic algorithm for solving it.

When a web browser visits a website for the first time, the website creates and assigns the browser a unique identifier (cookie), which is sent to the website on consecutive visits until it expires or is deleted by user through the browser interface or otherwise.

1 INTRODUCTION 

Metrics:
1 the number of unique visitors 
2 the average engagement time of a user
3 the number of repeat visitors to the site

Cookie-based unique visitor counting system are widely implemented by many websites, usually overestimate the number of unique visitors by large factors. We refer to problem of frequent cookie deletion as “cookie churn”.

Recently-created cookies have little history to learn from, aka, cold start problem, which negatively affects model performance. 

Estimate cookie visitor ratio, and then infer the number of unique visitors to site based on the number of unique cookies and these ratios. A similar work tries to find the cookie-visitor or IP-visitor ratios and uses these to normalize the number of cookies.

Such ratio-based methods are applicable only to coarse grained user counting and not to the user behavioral modeling or the reach and frequency problems.

We aim to construct similar browser fingerprints, but based on features that are much more benign and less discriminative.

Approach:
User-modeling and reach-frequency estimation from ad campaign is to create clusters of cookies based on a behavioral profile - each cluster ideally will represent all and only the cookies associated with a single browser.
The lifetime of cookies is defined as time elapsed between cookie creation and the last page view.

Challenges:
First, the solution has to scale well to be able to process hundreds of millions of cookies created every month at major websites.
Second, since each cluster represents a browser, the number of clusters is unknown - in fact, one of the outputs of our method is an estimate of the number of underlying browsers. 
Third, the large websites that have millions of simultaneous visitors, the number of cannot-link constraints is quadratic in the number of cookies, stressing scalability even more.
Fourth, the amount of useful information in their profiles is low, making it hard to extract meaningful cookie profiles and to design accurate similarity measure among these profiles.

Solutions
Each cookie is an interval defined by its lifetime.
Partition the graph into the provably minimal number of groups of cookies where the lifetimes of all cookies in a group do not overlap.
Predict the likelihood of a cookie to belong to one of the several candidate clusters. Cookie profile features such as cookie lifetime, number of page views, OS type, distribution of IP addresses, categories of visited webpages, etc.  Profile similarity measures include L2 distance and Embedding Bayes factors which learned through logistic regression.

Contribution
Formalize the problem of dealing with cookie churn by creating clusters of cookies using a constrained clustering framework.
Define a simple browser model for cookie lifetimes and also uses real-world data to validate our assumption. 
Define a set of cookie profile features and propose and efficient and scalable clustering algorithm based on interval graph coloring and cookie profile similarity measures.
Evaluate our algorithms on the real-world Y! data and show that it outperforms various baseline methods.

2 COOKIE CLUSTERING

LEMMA1 Suppose a browser has the visit-distribution Exp(\lamda) and the cookie-cleaning probability p for each visit. Suppose the first visit time and last visit time of the i-th cookie from this browser are denoted by f_i and l_i respectively and WLOG let f_i=0. Then,
1 The fraction of cookies that have lifetime d_i =0 is p^2.
2 For the cookies with positive lifetime, the lifetime d_i is distributed as Exp(\lambda.p)
3 The intervals f_(i+k) - l_i is the first cookie following i that has positive lifetime, are distributed as Exp((1-p)*\lambda).



1 - Real Time Bid Optimization with Smooth Budget Delivery in Online Advertising

ABSTRACT

Restricted by the budget, the goal is to buy a set of ad impressions to reach as many targeted users as possible.
An online approach to the smooth budget delivery while optimizing for the conversion performance.

1 INTRODUCTION
RTB exchanges provide a technology for advertisers to algorithmically place a bid on any individual impression through a public auction. This functionality enables advertisers to buy inventory in a cost effective manner, and serve ads to the right person in the right context at the right time.

Demand-side platforms (DSPs) offer such a solution called real time bid optimization to help advertisers find the optimal bid value for each ad request.

The process of real time bid optimization tries to maximize the campaign performance goal under the delivery constraint within the budget schedule. 
Performance goal: minimizing cost-per-click(CPC) or cost-per-action(CPA); maximizing click-through-rate(CTR) or conversion-rate (CVR). Under the constraint: smooth budget.

2 BACKGROUND AND RELATED WORK

Two situations won’t occur:

Premature Campaign Stop: Advertisers do not want their campaigns to run out of the budget prematurely in the day so as not to miss opportunities for the rest of day.

Fluctuation in Spend: Advertisers would like to be able to analyze their campaigns regularly and high fluctuations in the budget make the consistency of the results questionable.

Two main issues:

Traffic Issue: Depending on the target audience, the volume of the online traffic varies a lot throughout the day.
Performance Issue: The quality of the online traffic changes over the course of the day for different groups of audience.

3 ONLINE BID OPTIMIZATION

Two different classes of campaigns: 

flat CPM campaign: CTR or CVR

dynamic CPM campaign: CPC or CPA

3.1 Smooth Delivery of Budget

S(t) is the dollar amount of money spent
reqs(t) is the number of incoming ad requests that satisfy the audience targeting constraints of the campaign
bids(t) is the number of ad requests that the campaign has bid on
imps(t) is the number of impressions of the campaign

pacing rate(t) = bids(t)/reqs(t)
win_rate(t) = imps(t)/bids(t)

3.2 Selection of High Quality Ad Requests - Flat CPM Campaigns

flat CPM campaign: CTR or CVR

win_rate(t) = imps(t)/bids(t)

goal is to simply select a set of ad requests to bid on considering the current time slot pacing rate
imps*(t) = s(t)/c*

For flat CPM campaigns that always submit a fixed bid price c* to RTB exchanges, the goal is to simply selct a set of ad requests to bid on considering the current time slot pacing rate.

In all, its CTR or CVR value is first estimated by the statistical model. If the predicted value is larger than the upper bound of the threshold, this ad request will be kept and the fixed bid price v* will be submitted to the RTB exchange. If the predicated value is smaller than the lower bound of the threshold, this ad request will be simply dropped without further processing.

3.3 Selection of High Quality Ad Requests - Dynamic CPM Campaigns

dynamic CPM campaign: CPC or CPA

For dynamic CPM campaigns are free to change the bid price dynamically for each incoming ad request, the goal is to win enough number of high quality impressions for less cost.

Base bid price ui = CVR*G

Thresholds beta1 and beta2:
Safe region: no under-delivery
Critical region: delivery is normal
Danger region: campaign has a hard time to find and win enough impressions.

For the case where our campaign is in the critical region, submitted ci_hat = u_i

For the case where our campaign is in the safe region, submitted ci_hat=theta*u_i, and paid ci, compute histogram of theta=ci/ci_hat.

For the case where our campaign is in the danger region, 
If the audience targeting constraints are too tight and hence, there are not enough incoming ad requests selected for a bid, nothing could do.
If the bid price is not high enough to win the public auction even if we bid very frequently, boost the bidding price.

3.4 Estimation of CTR and CVR

Firstly, the estimated CVR provides a quality assessment for each ad request helping to decide on taking action on that particular ad request. 
Secondly, the base bid price is set to be the estimated CVR multiplied by the CPA goal, which directly affects the cost of advertising.

The hierarchy starts with the root and continues layer after layer by advertiser category, advertiser, insertion order, package, lineitem, ad and creative.

4 PRACTICAL ISSUES

4.1 Cold Start Problem

4.2 Prevention of Overspending

4.3 Distributed Architecture


5 EXPERIMENTAL RESULTS

5.1 Comparison of Pacing Strategies

Sunday, September 20, 2015

Power Iteration for Page Rank in R

Use Case of Page Rank Algorithm

Today, we get such a ranking thanks to the work of Jose Lages at the University of Franche-Comte in France and a few pals. They’ve used the way universities are mentioned on Wikipedia to produce a world ranking. Their results provide a new way to think about rankings that may help to avoid some of the biases that can occur in other ranking systems.

Biases can crop up remarkably easily. For example, in the last century, English has become the de facto language of science, and the advantage this gives English-speaking countries is hard to quantify.

And there are other factors that are unique to university rankings. Some institutions focus more on teaching than on research—how should these factors be balanced?

The new work attempts to get around some of these problems using the Pagerank algorithm that Google famously uses to rank websites in search results. This uses the network of links between nodes on a network to determine those that are the most important.

The key insight is that the algorithm counts a node as important if other important nodes point to it. So it repeatedly works through the links, recalculating the importance of every node on each iteration, to come up with a ranking.

Exactly this process can be applied to Wikipedia articles. Each university mentioned in an article is a node in the network, and the links pointing toward it are used to determine a ranking (see also “Artificial Intelligence Aims to Make Wikipedia Friendlier and Better”).

Lages and co apply this process to 24 different language editions of Wikipedia. This database contains some four million articles in English, 1.5 million in German and around a million in each of French, Dutch, Italian, Spanish, and Russian. It also includes Chinese, Hebrew, Hungarian, and so on. “These 24 languages cover 59% of world population and 68% of the total number of Wikipedia articles in all 287 languages,” they say.

The team first determines a ranking for each language and point out that each language edition tends to favor its own universities. So the top 100 list in French includes 32 French-speaking universities, the top 100 in German includes 63 German-speaking universities, and so on. They then combine the lists to produce a global ranking.

Mathematics of Page Rank Algorithm

http://www.ams.org/samplings/feature-column/fcarc-pagerank


# simple power iteration
a <- c(1/3,1/3,1/3);
A <- matrix(c(1/2,1/2,0,1/2,0,1,0,1/2,0), nrow=3);
for (i in 1:100){
  a=A%*%a;
  print(a)
}

# teleport
a <- c(1/3,1/3,1/3);
A <- matrix(c(1/2,1/2,0,1/2,0,0,0,1/2,1), nrow=3);
B <- matrix(rep(1/3,9), nrow=3);
C <- .8*A+.2*B;
for (i in 1:100){
  a=C%*%a;
  print(a)
}

# ex1
a <- c(1/3,1/3,1/3);
A <- matrix(c(0,0,0,1/2,0,0,1/2,1,1), nrow=3, byrow=T);
B <- matrix(rep(1/3,9), nrow=3);
C <- .7*A+.3*B;
for (i in 1:100){
  a=C%*%a;
  print(a)
}


a <- c(1,1,1);
A <- matrix(c(0,0,1,1/2,0,0,1/2,1,0), nrow=3, byrow=T);
for (i in 1:100){
  a=A%*%a;
  print(a)
}

Sunday, September 6, 2015

Estimating the exponent from empirical data in R

   pwrdist <- function(u,...) {
      # u is vector of event counts, e.g. how many
      # crimes was a given perpetrator charged for by the police
      fx <- table(u)
      i <- as.numeric(names(fx))
      y <- rep(0,max(i))
      y[i] <- fx
      m0 <- glm(y~log(1:max(i)),family=quasipoisson())
      print(summary(m0))
      sub <- paste("s=",round(m0$coef[2],2),"lambda=",sum(u),"/",length(u))
      plot(i,fx,log="xy",xlab="x",sub=sub,ylab="counts",...)
      grid()
      lines(1:max(i),(fitted(m0)),type="l")
      return(m0)

  }


Thursday, March 5, 2015

OLS Model in R

Content Weekly Trend
========================================================

### Read-in content raw data - weekly aggregated data

```{r echo=FALSE, results='hide',message=FALSE}
memory.limit()
memory.size(max = TRUE)
rm(list=ls(all=T))

require(data.table)
require(ggplot2)
require(plyr)

data <- read.table("content_data.csv", head=F, sep=',');
setnames(data, c("pv","mon","wk", "week"));
head(data)
```

```{r echo=FALSE, results='hide',message=FALSE, fig.width=20, fig.height=8}
ggplot(data, aes(x=wk, y=pv)) +
  geom_bar(stat = "identity", fill = "blue")+
  geom_abline(intercept = mean(data$pv), color = 'red', size = 1, lty = 3) +
  ggtitle("Weekly Pageview Counts") +
  ylab("Page Views") + xlab("Week") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 16, angle=-90)) +
  theme(axis.text.y = element_text(face = "bold", size = 16)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle = 90)) +
  theme(legend.position = "top") +
  theme(legend.key = element_rect(colour = NA)) +
  theme(legend.title = element_blank()) +
  theme(legend.position = "none")

```

Drop first 11 month pageviews.
```{r}
data <- data[14:77,]
head(data)
```

### OLS

#### Stata code
```
import excel "mytest.xlsx",
sheet("Sheet1") firstrow tsset mycount
generate mylog = log(wvolume)
generate cc = mycount*mycount
regress mylog  mycount i.mymonth
```

### Regress on continuous month
```{r}
data$mylog = log(data$pv)
data$mycount = (1:length(data$pv))
data$cc = data$mycount^2
ols1 <- lm(data$mylog ~ data$mycount + data$mon, data = data)
summary(ols1)
exp(ols1$coefficients[3])
AIC(ols1)
BIC(ols1)
```

### Regress on discrete month
```{r}
data$mon = as.factor(data$mon)
ols2 <- lm(data$mylog ~ data$mycount + data$cc + data$mon, data = data)
summary(ols2)
AIC(ols2)
BIC(ols2)
par(mfrow=c(2, 2))
plot(ols2)

anova(ols1, ols2)

```

### trending factor
```{r}
exp(ols2$coefficients[2]+ols2$coefficients[3])
```

### monthly adjustment factor
```{r}
exp(ols2$coefficients[4:14])
```

### Aggregated weekly data to generate monthly data
```{r}
data2 <- data.table(data)
data3<-data2[,list(pv=sum(pv),mon=min(as.integer(mon))), by=week][order(week)]
```


```{r echo=FALSE, results='hide',message=FALSE, fig.width=20, fig.height=8}
ggplot(data3, aes(x=week, y=pv)) +
  geom_bar(stat = "identity", fill = "darkblue")+
  geom_abline(intercept = mean(data3$pv), color = 'red', size = 1, lty = 3) +
  ggtitle("Monthly Pageview Counts") +
  ylab("PageViews") + xlab("Month") +
  theme(plot.title = element_text(face = "bold", size = 20)) +
  theme(axis.text.x = element_text(face = "bold", size = 16, angle=-90)) +
  theme(axis.text.y = element_text(face = "bold", size = 16)) +
  theme(axis.title.x = element_text(face = "bold", size = 16)) +
  theme(axis.title.y = element_text(face = "bold", size = 16, angle = 90)) +
  theme(legend.position = "top") +
  theme(legend.key = element_rect(colour = NA)) +
  theme(legend.title = element_blank()) +
  theme(legend.position = "none")
```

```{r}
data3$mylog = log(data3$pv)
data3$mycount = (1:length(data3$pv))
data3$cc = data3$mycount^2
data3$mon = as.factor(data3$mon)

ols3 <- lm(data3$mylog ~ data3$mycount + data3$cc + data3$mon, data = data3)
summary(ols3)
AIC(ols3)
BIC(ols3)
par(mfrow=c(2, 2))
plot(ols3)
```

### trending factor
```{r}
exp(ols3$coefficients[2]+ols3$coefficients[3])
```

### monthly adjustment factor
```{r}
exp(ols3$coefficients[4:14])
```