My DS Coding Bolg: Bandit algorithms

Showing posts with label Bandit algorithms. Show all posts

Tuesday, March 15, 2016

Bandit algorithms 7 - Bandits in the Real World: Complexity and Complications

A/A Testing

you should use the code you develop for assigning users to arms, but then define two different arms that are actually both the same user experience. If you find differences between these two identical arms, you need to substantially temper your claims about differences between other arms. This illusory difference between two A arms may indicate that there’s a bug in your code or a mistake in the way you’re analyzing your data. But it may also indicate that your test is running in a context that is subtly different from the assumptions we’ve been implicitly making when setting up our algorithms and simulations.

Running Concurrent Experiments

In an ideal world, concurrency issues raised by running multiple experiments at once won’t come up. The best solution to this is simple: try your best to keep track of all of the experiments each user is a part of and include this information in your analyses of any single experiment.

Continuous Experimentation vs. Periodic Testing

Bandit algorithms look much better than A/B testing when you are willing to let them run for a very long time.

Bad Metrics of Success

Once you decide to start working with automated metrics, you need to supplement those systems by exercising human judgment and making sure that you keep an eye on what happens as the system makes changes to your site.

Scaling Problems with Good Metrics of Success

The algorithms described in this book may not work well unless you rescale those metrics into the 0-1 space we’ve used in our examples. The reasons for this are quite boring: some of the algorithms are numerically unstable, especially the softmax algorithm, which will break down if you start trying to calculate things.

Intelligent Initialization of Values

First, you can use the historical metrics for the control arm in your bandit algorithm.

Second, you can use the amount of historical data you have to calibrate how much the algorithm thinks you know about the historical options.

Running Better Simulations

In addition to initializing your algorithm using prior information you have before deploying a Bandit algorithm, you can often run much better simulations if you use historical information to build appropriate simulations.

Moving Worlds

Arms with changing values can be a very serious problem if you’re not careful when you deploy a bandit algorithm.

new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward

self.values[chosen_arm] = new_value

Correlated Bandits

Look into classical smoothing techniques in statistics to get a sense for how you might deal with correlated arms.

Contextual Bandits

"A Contextual-Bandit Approach to Personalized News Article Recommendation"

"Parametric Bandits: The Generalized Linear Case".

Implementing Bandit Algorithms at Scale

Blocked assignments

Assign incoming users to new arms in advance and draw this information from a fast cache when users actually arrive. Sore their responses for batch processing later in another fast cache.

Blocked updates

Update your estimates of arm values in batches on a regular interval and regenerate your blocked assignments.

Here are the most salient lessons:

Trade-offs, trade-offs, trade-offs

In the real world, you always have to trade off between gathering data and acting on that data. Pure experimentation in the form of exploration is always a short-term loss, but pure profit-making in the form of exploitation is always blind to the long-term benefits of curiosity and openmindedness. You can be clever about the compromises you make, but you will have to make some compromises.

God does play dice

Randomization is the key to the good life. Controlled experiments online won’t work without randomization. If you want to learn from your experiences, you need to be in complete control of those experiences. While the UCB algorithms we’ve used in this book aren’t truly randomized, they behave at least partially like randomized algorithms from the perspective of your users. Ultimately what matters most is that you make sure that end-users can’t self-select into the arms you want to experiment with.

Defaults matter a lot

The way in which you initialize an algorithm can have a powerful effect on its long-term success. You need to figure out whether your biases are helping you or hurting you. No matter what you do, you will be biased in some way or another. What matters is that you spend some time learning whether your biases help or hurt. Part of the genius of the UCB family of algorithms is that they make a point to do this initialization in a very systematic way right at the start.

Take a chance

You should try everything at the start of your explorations to insure that you know a little bit about the potential value of every option. Don’t close your mind without giving something a fair shot. At the same time, just one experience should be enough to convince you that some mistakes aren’t worth repeating.

Everybody’s gotta grow up sometime

You should make sure that you explore less over time. No matter what you’re doing, it’s important that you don’t spend your whole life trying out every crazy idea that comes into your head. In the bandit algorithms we’ve tried, we’ve seen this lesson play out when we’ve implemented annealing. The UCB algorithms achieve similar effects to annealing by explicitly counting their experiences with different arms. Either strategy is better than not taking any steps to become more conservative over time.

Leave your mistakes behind

You should direct your exploration to focus on the second-best option, the third-best option and a few other options that are just a little bit further away from the best. Don’t waste much or any of your time on options that are clearly losing bets. Naive experimentation of the sort that occurs in A/B testing is often a deadweight loss if some of the ideas you’re experimenting with are disasters waiting to happen.

Don’t be cocky

You should keep track of how confident you are about your evaluations of each of the options available to you. Don’t be close-minded when you don’t have evidence to support your beliefs. At the same time, don’t be so unsure of yourself that you forget how much you already know. Measuring one’s confidence explicitly is what makes UCB so much more effective than either the epsilon-Greedy algorithm or the Softmax algorithm in some settings.

Context matters

You should use any and every piece of information you have available to you about the context of your experiments. Don’t simplify the world too much and pretend you’ve got things figured out: there’s more to optimizing your business that comparing A with B. If you can figure out a way to exploit context using strategies like those seen in the contextual bandit algorithms we briefly discussed, use them. And if there are ways to generalize your experiences across arms, take advantage of them.

Monday, March 14, 2016

Bandit algorithms 6 - UCB - The Upper Confidence Bound Algorithm

-- Introducing the UCB Algorithm

Both the epsilon-Greedy algorithm and the Softmax algorithm share the following broad properties:

The algorithm's default choice is to select the arm that currently has the highest estimated value.

The algorithm sometimes decides to explore and chooses an option that isn't the one that currently seems best:

The epsilon-Greedy algorithm explores by selecting from all of the arms completely at random. It makes one of these random exploratory decisions with probability epsilon.

The Softmax algorithm explores by randomly selecting from all of the available arms with probilities that are more-or-less proportional to the estimated value of each of the arms.

In order to achieve better performance by making an effort to have these two algorithms explore less over time, both algorithms can be set up to modify their basic parameters dynamically over time by annealing.

The epsilon-Greedy and Softmax algorithms are gullible. They are easily misled by a few negative experiences. Because of their use of randomness, they can make up for this later.

UCB doesn’t use randomness at all. Unlike epsilon-Greedy or Softmax, it’s possible to know exactly how UCB will behave in any given situation. This can make it easier to reason about at times.

UCB doesn’t have any free parameters that you need to configure before you can deploy it. This is a major improvement if you’re interested in running it in the wild, because it means that you can start to use UCB without having a clear sense of what you expect the world to behave like.

-- Implementing UCB

class UCB1():

def __init__(self, counts, values):

self.counts = counts

self.values = values

return

def initialize(self, n_arms):

self.counts = [0 for col in range(n_arms)]

self.values = [0.0 for col in range(n_arms)]

return

def select_arm(self):

n_arms = len(self.counts)

for arm in range(n_arms):

if self.counts[arm] == 0:

return arm

ucb_values = [0.0 for arm in range(n_arms)]

total_counts = sum(self.counts)

for arm in range(n_arms):

bonus = math.sqrt((2 * math.log(total_counts)) / float(self.counts[arm]))

ucb_values[arm] = self.values[arm] + bonus

return ind_max(ucb_values)

def update(self, chosen_arm, reward):

self.counts[chosen_arm] = self.counts[chosen_arm] + 1

n = self.counts[chosen_arm]

value = self.values[chosen_arm]

new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward

self.values[chosen_arm] = new_value

return

-- Comparing Bandit Algorithms Side-by-Side

We can very clearly see how much noisier UCB1’s behavior looks than the epsilon-Greedy or Softmax algorithm’s.

We see that the epsilon-Greedy algorithm doesn’t converge as quickly as the Softmax algorithm. This might suggest that we need to use another annealing schedule or that this testbed is one in which the Softmax algorithm is simply superior to the epsilon-Greedy algorithm.

We see that UCB1 takes a while to catch up with the annealing Softmax algorithm, but that it does start to catch up right near the end of the plays we’ve simulated. In the exercises we encourage you to try other enviroments in which UCB1 might outperform Softmax unambigiously.

UCB1 finds the best arm very quickly, but the backpedaling it does causes it to underperform the Softmax algorithm along most metrics.

Bandit algorithms 5 - The Softmax Algorithm

-- Introducing the Softmax Algorithm

The Softmax algorithm tries to cope with arms differing in estimated value by explicitly incorportatin g information about the reward rates of the available arms into its method for choosing which arm to selet when it explores.

def categorical_draw(probs):

z = random.random()

cum_prob = 0.0

for i in range(len(probs)):

prob = probs[i]

cum_prob += prob

if cum_prob > z:

return i

return len(probs) - 1

def select_arm(self):

z = sum(self.values)

probs = [v / z for v in self.values]

return categorical_draw(probs)

Temperature parameter based on an analogy with physics in which systems at high temperatures tend to behave randomly, while they take on more structure at low temperatures. In fact, the full Softmax algorithm is closely related to a concept called the Boltzmann distribution in physics, which is used to describe how groups of particles behave.

We’ll call this new temperature parameter tau. We introduce tau to produce the following new algorithm:

At time T, select one of the two arms with probabilities computed as follows:

exp(rA / tau) / (exp(rA / tau) + exp(rB / tau))

exp(rB / tau) / (exp(rA / tau) + exp(rB / tau))

For whichever arm you picked, update your estimate of the mean using the same update rule we used for the epsilon-Greedy algorithm.

-- Implementing the Softmax Algorithm

import math

import random

def categorical_draw(probs):

z = random.random()

cum_prob = 0.0

for i in range(len(probs)):

prob = probs[i]

cum_prob += prob

if cum_prob > z:

return i

return len(probs) - 1

class Softmax:

def __init__(self, temperature, counts, values):

self.temperature = temperature

self.counts = counts

self.values = values

return

def initialize(self, n_arms):

self.counts = [0 for col in range(n_arms)]

self.values = [0.0 for col in range(n_arms)]

return

def select_arm(self):

z = sum([math.exp(v / self.temperature) for v in self.values])

probs = [math.exp(v / self.temperature) / z for v in self.values]

return categorical_draw(probs)

def update(self, chosen_arm, reward):

self.counts[chosen_arm] = self.counts[chosen_arm] + 1

n = self.counts[chosen_arm]

value = self.values[chosen_arm]

new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward

self.values[chosen_arm] = new_value

return

execfile("core.py")

import random

random.seed(1)

means = [0.1, 0.1, 0.1, 0.1, 0.9]

n_arms = len(means)

random.shuffle(means)

arms = map(lambda (mu): BernoulliArm(mu), means)

print("Best arm is " + str(ind_max(means)))

f = open("algorithms/softmax/standard_softmax_results.tsv", "w")

for temperature in [0.1, 0.2, 0.3, 0.4, 0.5]:

algo = Softmax(temperature, [], [])

algo.initialize(n_arms)

results = test_algorithm(algo, arms, 5000, 250)

for i in range(len(results[0])):

f.write(str(temperature) + "\t")

f.write("\t".join([str(results[j][i]) for j in range(len(results))]) + "\n")

f.close()

Measuring the Performance of the Softmax Algorithm

execfile("core.py")

import random

random.seed(1)

means = [0.1, 0.1, 0.1, 0.1, 0.9]

n_arms = len(means)

random.shuffle(means)

arms = map(lambda (mu): BernoulliArm(mu), means)

print("Best arm is " + str(ind_max(means)))

f = open("algorithms/softmax/standard_softmax_results.tsv", "w")

for temperature in [0.1, 0.2, 0.3, 0.4, 0.5]:

algo = Softmax(temperature, [], [])

algo.initialize(n_arms)

results = test_algorithm(algo, arms, 5000, 250)

for i in range(len(results[0])):

f.write(str(temperature) + "\t")

f.write("\t".join([str(results[j][i]) for j in range(len(results))]) + "\n")

f.close()

-- The Annealing Softmax Algorithm

Annealing is the process of decreasing the temperature in a Softmax algorithm over time.

Annealing is a process of modifying a bandit algorithm’s behavior so that it will explore less over time.

t = sum(self.counts) + 1

temperature = 1 / math.log(t + 0.0000001)

import math

import random

def categorical_draw(probs):

z = random.random()

cum_prob = 0.0

for i in range(len(probs)):

prob = probs[i]

cum_prob += prob

if cum_prob > z:

return i

return len(probs) - 1

class AnnealingSoftmax:

def __init__(self, counts, values):

self.counts = counts

self.values = values

return

def initialize(self, n_arms):

self.counts = [0 for col in range(n_arms)]

self.values = [0.0 for col in range(n_arms)]

return

def select_arm(self):

t = sum(self.counts) + 1

temperature = 1 / math.log(t + 0.0000001)

z = sum([math.exp(v / temperature) for v in self.values])

probs = [math.exp(v / temperature) / z for v in self.values]

return categorical_draw(probs)

def update(self, chosen_arm, reward):

self.counts[chosen_arm] = self.counts[chosen_arm] + 1

n = self.counts[chosen_arm]

value = self.values[chosen_arm]

new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward

self.values[chosen_arm] = new_value

return

Friday, March 11, 2016

Bandit algorithms 4 - Debugging Bandit Algorithms

-- Monte Carlo Simulations Are Like Unit Tests for Bandit Algorithms

active learning refers to algorithms that actively select which data they should receive;

online learning, which refers to algorithms that analyze data in real-time and provide results on the fly.

-- Simulating the Arms of a Bandit Problem

Optimizing click-through rates for ads: Every time we show someone an ad, we’ll imagine that there’s a fixed probability that they’ll click on the ad. The bandit algorithm will then estimate this probability and try to decide on a strategy for showing ads that maximizes the click-through rate.

Conversion rates for new users: Every time a new visitor comes to our site who isn’t already a registered user, we’ll imagine that there’s a fixed probability that they’ll register as a user after seeing the landing page. We’ll then estimate this probability and try to decide on a strategy for maximizing our conversion rate.

class BernoulliArm():

def __init__(self, p):

self.p = p

def draw(self):

if random.random() > self.p:

return 0.0

else:

return 1.0

means = [0.1, 0.1, 0.1, 0.1, 0.9]

n_arms = len(means)

random.shuffle(means)

arms = map(lambda (mu): BernoulliArm(mu), means)

arms[0].draw()

arms[1].draw()

arms[2].draw()

arms[3].draw()

arms[2].draw()

arms[4].draw()

>>> arms[0].draw()

1.0

>>> arms[1].draw()

0.0

>>> arms[2].draw()

0.0

>>> arms[2].draw()

0.0

>>> arms[3].draw()

0.0

>>> arms[2].draw()

0.0

>>> arms[4].draw()

0.0

def test_algorithm(algo, arms, num_sims, horizon):

chosen_arms = [0.0 for i in range(num_sims * horizon)]

rewards = [0.0 for i in range(num_sims * horizon)]

cumulative_rewards = [0.0 for i in range(num_sims * horizon)]

sim_nums = [0.0 for i in range(num_sims * horizon)]

times = [0.0 for i in range(num_sims * horizon)]

for sim in range(num_sims):

sim = sim + 1

algo.initialize(len(arms))

for t in range(horizon):

t = t + 1

index = (sim - 1) * horizon + t - 1

sim_nums[index] = sim

times[index] = t

chosen_arm = algo.select_arm()

chosen_arms[index] = chosen_arm

reward = arms[chosen_arms[index]].draw()

rewards[index] = reward

if t == 1:

cumulative_rewards[index] = reward

else:

cumulative_rewards[index] = cumulative_rewards[index - 1] + reward

algo.update(chosen_arm, reward)

return [sim_nums, times, chosen_arms, rewards, cumulative_rewards]

execfile("core.py")

import random

random.seed(1)

means = [0.1, 0.1, 0.1, 0.1, 0.9]

n_arms = len(means)

random.shuffle(means)

arms = map(lambda (mu): BernoulliArm(mu), means)

print("Best arm is " + str(ind_max(means)))

f = open("algorithms/epsilon_greedy/standard_results.tsv", "w")

for epsilon in [0.1, 0.2, 0.3, 0.4, 0.5]:

algo = EpsilonGreedy(epsilon, [], [])

algo.initialize(n_arms)

results = test_algorithm(algo, arms, 5000, 250)

for i in range(len(results[0])):

f.write(str(epsilon) + "\t")

f.write("\t".join([str(results[j][i]) for j in range(len(results))]) + "\n")

f.close()

-- Analyzing Results from a Monte Carlo Study

Approach 1: Track the Probability of Choosing the Best Arm

Approach 2: Track the Average Reward at Each Point in Time

Approach 3: Track the Cumulative Reward at Each Point in Time

Bandit algorithms 3 - The Epsilon-Greedy Algorithm

-- Introducing the Epsilon-Greedy Algorithm

When a new visitor comes to the site, the algorithm flips a coin that comes up tails with probability epsilon. (If you’re not used to thinking in terms of probabilities, the phrase "with probability X" means that something happens 100 * X percent of the time. So saying that a coin comes up tails with probability 0.01 means that it comes up tails 1% of the time.)

If the coin comes up heads, the algorithm is going to exploit. To exploit, the algorithm looks up the historical conversion rates for both the green and red logos in whatever data source it uses to keep track of things. After determining which color had the highest success rate in the past, the algorithm decides to show the new visitor the color that’s been most successful historically.

If, instead of coming up heads, the coin comes up tails, the algorithm is going to explore. Since exploration involves randomly experimenting with the two colors being considered, the algorithm needs to flip a second coin to choose between them. Unlike the first coin, we’ll assume that this second coin comes up head 50% of the time. Once the second coin is flipped, the algorithm can move on with the last step of the procedure:

If the second coin comes up heads, show the new visitor the green logo.

If the second coin comes up tails, show the new visitor the red logo.

With probability 1 – epsilon, the epsilon-Greedy algorithm exploits the best known option.

With probability epsilon / 2, the epsilon-Greedy algorithm explores the best known option.

With probability epsilon / 2, the epsilon-Greedy algorithm explores the worst known option.

-- Describing Our Logo-Choosing Problem Abstractly

What’s an Arm?

Arm makes more sense given the original motivations behind the design of the algorithms we’re describing in this book: these algorithms were originally invented to explain how an idealized gambler would try to make as much money as possible in a hypothetical casino. In this hypothetical casino, there’s only one type of game: a slot machine, which is also sometimes called a one-armed bandit because of its propensity to take your money. While this casino only features slot machines, it could still be an interesting place to visit because there are many different slot machines, each of which has a different payout schedule.

What's a Reward?

A reward is simply a measure of success: it might tell us whtether a customer clicked on an ad or signed up as a user.

What's a Bandit Problem?

We’re facing a complicated slot machine, called a bandit, that has a set of N arms that we can pull on.

When pulled, any given arm will output a reward. But these rewards aren’t reliable, which is why we’re gambling: Arm 1 might give us 1 unit of reward only 1% of the time, while Arm 2 might give us 1 unit of reward only 3% of the time. Any specific pull of any specific arm is risky.

Not only is each pull of an arm risky, we also don’t start off knowing what the reward rates are for any of the arms. We have to figure this out experimentally by actually pulling on the unknown arms.

We only find out about the reward that was given out by the arm we actually pulled. Whichever arm we pull, we miss out on information about the other arms that we didn’t pull. Just like in real life, you only learn about the path you took and not the paths you could have taken.

Every time we experiment with an arm that isn’t the best arm, we lose reward because we could, at least in principle, have pulled on a better arm.

-- Implementing the epsilon-Greedy Algorithm

class EpsilonGreedy():

def __init__(self, epsilon, counts, values):

self.epsilon = epsilon

self.counts = counts

self.values = values

return

def initialize(self, n_arms):

self.counts = [0 for col in range(n_arms)]

self.values = [0.0 for col in range(n_arms)]

return

def ind_max(x):

m = max(x)

return x.index(m)

def select_arm(self):

if random.random() > self.epsilon:

return ind_max(self.values)

else:

return random.randrange(len(self.values))

def update(self, chosen_arm, reward):

self.counts[chosen_arm] = self.counts[chosen_arm] + 1

n = self.counts[chosen_arm]

value = self.values[chosen_arm]

new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward

self.values[chosen_arm] = new_value

return

def average(values):

result = 0.0

for value in values:

result = result + value

return result / len(values)

def average(values):

result = 0.0

n = float(len(values))

for value in values:

result = result + value / n

return result

-- Thinking Critically about the epsilon-Greedy Algorithm

algo = EpsilonGreedy(1.0, [])

initialize(algo, 2)

algo.epsilon = 0.0

But there are weaknesses with this approach as well. The first weakness is that, as you get more certain which of your two logo designs is best, this tendency to explore the worse design a full 5% of the time will become more wasteful. In the jargon of bandit algorithms, you’ll be over-exploring. And there’s another problem with a fixed 10% exploration rule: at the start of your experimentation, you’ll choose options that you don’t know much about far more rarely than you’d like to because you only try new options 10% of the time.

Bandit algorithms 2 - Multiarmed Bandit Algorithms

What Are We Trying to Do?

Traffic

Conversions

Sales

CTR's

The Business Scientist: Web-Scale A/B Testing

Standard A/B testing consists of:

A short period of pure exploration, in which you assign equal numbers of users to Groups A and B.

A long period of pure exploitation, in which you send all of your users to the more successful version of your site and never come back to the option that seemed to be inferior.

Why might this be a bad strategy?

It jumps discretely from exploration into exploitation, when you might be able to smoothly transition between the two.

During the purely exploratory phase, it wastes resources exploring inferior options in order to gather as much data as possible. But you shouldn’t want to gather data about strikingly inferior options.

Bandit algorithms provide solutions to both of these problems:

1) they smoothly decrease the amount of exploring they do over time instead of requiring you to make a sudden jump.

2) they focus your resources during exploration on the better options instead of wasting time on the inferior options that are over-explored during typical A/B testing.

Bandit algorithms 1 - Exploration and Exploitation

A Glossary

Reward

Arm

Bandit

Play/Trial

Horizon

Exploitation

Exploration

Explore/Exploit Dilemma

Annealing

Temperature

Streaming Algorithms

Online Learning

Active Learning

Bernoulli

The Explore-Exploit Dilemma

(A) learn about new ideas (which we’ll always call exploring from now on), while you also need to

(B) take advantage of the best of your old ideas (which we’ll always call exploiting from now on).

Introduction to Bandits: Algorithms and Theory

The Multi-Armed Bandit problem is a statistical problem that seeks to optimize a gambler playing multiple slot machines. Each slot machine has an unknown probability of paying out, and the goal is to maximize the winnings over time. Slot machines, by the way, are sometimes called “one-armed bandits”, because-on average—you leave them with less money than you arrived with.

What is appealing about the Multi-Armed Bandit problem is that it is solvable. The basic “learn-and-proceed” algorithm works quite well. That is, for a certain period of time, play all the machines. Then continue to play the winning’est machine. Variations on the approach are also optimal, under a wide range of assumptions.

The Multi-Armed Bandit solution would seem to provide a path to site optimization. Start with a few variations. Treat each variation as a separate “slot machine”. Each time the page is about to be loaded, peer into the historical data on how well each version has done. Then choose the right version accordingly. Keep track of how good the page is. And, one of the optimization algorithms for the Multi-Armed Bandit problem can not only determine which page to load but guarantee optimal outcomes over time. Truly impressive.

This makes the testing problem much harder. If we know a handful of categories in advance, we do separate multi-armed bandit problems for each. For instance:

- US-based, web-based, weekday working hours.
- US-based, mobile, weekday working hours.

Through the 12 combinations suggested by these groups.

When the groups are not predetermined, it is a harder problem to solve. This is often approached using off-line analysis, if there is adequate test data.

Another issue is that some messages may be targeted at a specific population. For instance, some ip addresses have registered as .EDU indicating educational institutions. How do you handle market testing for these groups?

If there is only one group, then you can set up separate Multi-Armed Bandit-type tests for EDU and for non-EDU. But this might get more complicated. For instance, some messages might be targeted for web versus mobile. And now you have overlap -- because EDU users are in both groups.

Yet another problem is that determining the outcome may take time. If the outcome is as simple as clicking on an ad, then that is known within a few minutes. If the outcome is purchasing a particular, well, not everyone puts something in a cart and immediately checks out. You might want a day lag. If the outcome is a longer term measurement (did site engagement go up? time to next visit?), then even longer periods are needed. This can result in a very slow feed back loop for the Multi-Armed Bandit. After all, a slot machine returns its winnings (or losings) very quickly.

What is the real solution to such adaptive marketing online? The real solution requires a comprehensive solution. This solution has to understand "natural" divisions online (such as organization type) and the fact that a given message may have differential effects in different areas. This solution has to take into account that some messages are targeted to specific visitors or customers (if they can be known), so pure randomization may not be possible. The solution has to understand the time dimension of the business, hopefully incorporating faster measures in the short-term, even if these are imperfect, along with validation of the measures in the longer term.

The Mutli-Armed bandit is a powerful idea. Unfortunately, the solution to this problem is only one part of an online marketing solution.

In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and receives a reward from it. The agent aims at maximizing rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms.

It is the simplest setting where one encounters the exploration-exploitation dilemma.

Tuesday, March 15, 2016

Monday, March 14, 2016

Friday, March 11, 2016

Blog Archive