My DS Coding Bolg: Bandit algorithms 3 - The Epsilon-Greedy Algorithm

-- Introducing the Epsilon-Greedy Algorithm

When a new visitor comes to the site, the algorithm flips a coin that comes up tails with probability epsilon. (If you’re not used to thinking in terms of probabilities, the phrase "with probability X" means that something happens 100 * X percent of the time. So saying that a coin comes up tails with probability 0.01 means that it comes up tails 1% of the time.)

If the coin comes up heads, the algorithm is going to exploit. To exploit, the algorithm looks up the historical conversion rates for both the green and red logos in whatever data source it uses to keep track of things. After determining which color had the highest success rate in the past, the algorithm decides to show the new visitor the color that’s been most successful historically.

If, instead of coming up heads, the coin comes up tails, the algorithm is going to explore. Since exploration involves randomly experimenting with the two colors being considered, the algorithm needs to flip a second coin to choose between them. Unlike the first coin, we’ll assume that this second coin comes up head 50% of the time. Once the second coin is flipped, the algorithm can move on with the last step of the procedure:

If the second coin comes up heads, show the new visitor the green logo.

If the second coin comes up tails, show the new visitor the red logo.

With probability 1 – epsilon, the epsilon-Greedy algorithm exploits the best known option.

With probability epsilon / 2, the epsilon-Greedy algorithm explores the best known option.

With probability epsilon / 2, the epsilon-Greedy algorithm explores the worst known option.

-- Describing Our Logo-Choosing Problem Abstractly

What’s an Arm?

Arm makes more sense given the original motivations behind the design of the algorithms we’re describing in this book: these algorithms were originally invented to explain how an idealized gambler would try to make as much money as possible in a hypothetical casino. In this hypothetical casino, there’s only one type of game: a slot machine, which is also sometimes called a one-armed bandit because of its propensity to take your money. While this casino only features slot machines, it could still be an interesting place to visit because there are many different slot machines, each of which has a different payout schedule.

What's a Reward?

A reward is simply a measure of success: it might tell us whtether a customer clicked on an ad or signed up as a user.

What's a Bandit Problem?

We’re facing a complicated slot machine, called a bandit, that has a set of N arms that we can pull on.

When pulled, any given arm will output a reward. But these rewards aren’t reliable, which is why we’re gambling: Arm 1 might give us 1 unit of reward only 1% of the time, while Arm 2 might give us 1 unit of reward only 3% of the time. Any specific pull of any specific arm is risky.

Not only is each pull of an arm risky, we also don’t start off knowing what the reward rates are for any of the arms. We have to figure this out experimentally by actually pulling on the unknown arms.

We only find out about the reward that was given out by the arm we actually pulled. Whichever arm we pull, we miss out on information about the other arms that we didn’t pull. Just like in real life, you only learn about the path you took and not the paths you could have taken.

Every time we experiment with an arm that isn’t the best arm, we lose reward because we could, at least in principle, have pulled on a better arm.

-- Implementing the epsilon-Greedy Algorithm

class EpsilonGreedy():

def __init__(self, epsilon, counts, values):

self.epsilon = epsilon

self.counts = counts

self.values = values

return

def initialize(self, n_arms):

self.counts = [0 for col in range(n_arms)]

self.values = [0.0 for col in range(n_arms)]

return

def ind_max(x):

m = max(x)

return x.index(m)

def select_arm(self):

if random.random() > self.epsilon:

return ind_max(self.values)

else:

return random.randrange(len(self.values))

def update(self, chosen_arm, reward):

self.counts[chosen_arm] = self.counts[chosen_arm] + 1

n = self.counts[chosen_arm]

value = self.values[chosen_arm]

new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward

self.values[chosen_arm] = new_value

return

def average(values):

result = 0.0

for value in values:

result = result + value

return result / len(values)

def average(values):

result = 0.0

n = float(len(values))

for value in values:

result = result + value / n

return result

-- Thinking Critically about the epsilon-Greedy Algorithm

algo = EpsilonGreedy(1.0, [])

initialize(algo, 2)

algo.epsilon = 0.0

But there are weaknesses with this approach as well. The first weakness is that, as you get more certain which of your two logo designs is best, this tendency to explore the worse design a full 5% of the time will become more wasteful. In the jargon of bandit algorithms, you’ll be over-exploring. And there’s another problem with a fixed 10% exploration rule: at the start of your experimentation, you’ll choose options that you don’t know much about far more rarely than you’d like to because you only try new options 10% of the time.

My DS Coding Bolg

Friday, March 11, 2016

Bandit algorithms 3 - The Epsilon-Greedy Algorithm

No comments:

Post a Comment

Blog Archive