Thursday, October 15, 2015

2 - Overcoming Browser Cookie Churn with Clustering

WSDM 2012

Group similar cookies into clusters that are more persistent than individual cookies. 

Such clustering could potentially allow more robust estimation of the number of unique visitors of the site over a certain long time period, and also better user modeling which is key to plenty of web applications such as advertising and recommender systems.

Cluster browser cookies into groups that are likely to belong to the same browser based on a statistical model of browser visitation patterns.

Generalized interval graph coloring problem and a greedy heuristic algorithm for solving it.

When a web browser visits a website for the first time, the website creates and assigns the browser a unique identifier (cookie), which is sent to the website on consecutive visits until it expires or is deleted by user through the browser interface or otherwise.

1 INTRODUCTION 

Metrics:
1 the number of unique visitors 
2 the average engagement time of a user
3 the number of repeat visitors to the site

Cookie-based unique visitor counting system are widely implemented by many websites, usually overestimate the number of unique visitors by large factors. We refer to problem of frequent cookie deletion as “cookie churn”.

Recently-created cookies have little history to learn from, aka, cold start problem, which negatively affects model performance. 

Estimate cookie visitor ratio, and then infer the number of unique visitors to site based on the number of unique cookies and these ratios. A similar work tries to find the cookie-visitor or IP-visitor ratios and uses these to normalize the number of cookies.

Such ratio-based methods are applicable only to coarse grained user counting and not to the user behavioral modeling or the reach and frequency problems.

We aim to construct similar browser fingerprints, but based on features that are much more benign and less discriminative.

Approach:
User-modeling and reach-frequency estimation from ad campaign is to create clusters of cookies based on a behavioral profile - each cluster ideally will represent all and only the cookies associated with a single browser.
The lifetime of cookies is defined as time elapsed between cookie creation and the last page view.

Challenges:
First, the solution has to scale well to be able to process hundreds of millions of cookies created every month at major websites.
Second, since each cluster represents a browser, the number of clusters is unknown - in fact, one of the outputs of our method is an estimate of the number of underlying browsers. 
Third, the large websites that have millions of simultaneous visitors, the number of cannot-link constraints is quadratic in the number of cookies, stressing scalability even more.
Fourth, the amount of useful information in their profiles is low, making it hard to extract meaningful cookie profiles and to design accurate similarity measure among these profiles.

Solutions
Each cookie is an interval defined by its lifetime.
Partition the graph into the provably minimal number of groups of cookies where the lifetimes of all cookies in a group do not overlap.
Predict the likelihood of a cookie to belong to one of the several candidate clusters. Cookie profile features such as cookie lifetime, number of page views, OS type, distribution of IP addresses, categories of visited webpages, etc.  Profile similarity measures include L2 distance and Embedding Bayes factors which learned through logistic regression.

Contribution
Formalize the problem of dealing with cookie churn by creating clusters of cookies using a constrained clustering framework.
Define a simple browser model for cookie lifetimes and also uses real-world data to validate our assumption. 
Define a set of cookie profile features and propose and efficient and scalable clustering algorithm based on interval graph coloring and cookie profile similarity measures.
Evaluate our algorithms on the real-world Y! data and show that it outperforms various baseline methods.

2 COOKIE CLUSTERING

LEMMA1 Suppose a browser has the visit-distribution Exp(\lamda) and the cookie-cleaning probability p for each visit. Suppose the first visit time and last visit time of the i-th cookie from this browser are denoted by f_i and l_i respectively and WLOG let f_i=0. Then,
1 The fraction of cookies that have lifetime d_i =0 is p^2.
2 For the cookies with positive lifetime, the lifetime d_i is distributed as Exp(\lambda.p)
3 The intervals f_(i+k) - l_i is the first cookie following i that has positive lifetime, are distributed as Exp((1-p)*\lambda).



No comments:

Post a Comment