Monday, March 14, 2016

Bandit algorithms 6 - UCB - The Upper Confidence Bound Algorithm

-- Introducing the UCB Algorithm

Both the epsilon-Greedy algorithm and the Softmax algorithm share the following broad properties:

The algorithm's default choice is to select the arm that currently has the highest estimated value.

The algorithm sometimes decides to explore and chooses an option that isn't the one that currently seems best:

The epsilon-Greedy algorithm explores by selecting from all of the arms completely at random. It makes one of these random exploratory decisions with probability epsilon.

The Softmax algorithm explores by randomly selecting from all of the available arms with probilities that are more-or-less proportional to the estimated value of each of the arms.

In order to achieve better performance by making an effort to have these two algorithms explore less over time, both algorithms can be set up to modify their basic parameters dynamically over time by annealing.

The epsilon-Greedy and Softmax algorithms are gullible. They are easily misled by a few negative experiences. Because of their use of randomness, they can make up for this later.

UCB doesn’t use randomness at all. Unlike epsilon-Greedy or Softmax, it’s possible to know exactly how UCB will behave in any given situation. This can make it easier to reason about at times.
UCB doesn’t have any free parameters that you need to configure before you can deploy it. This is a major improvement if you’re interested in running it in the wild, because it means that you can start to use UCB without having a clear sense of what you expect the world to behave like.

-- Implementing UCB

class UCB1():
  def __init__(self, counts, values):
    self.counts = counts
    self.values = values
    return

  def initialize(self, n_arms):
    self.counts = [0 for col in range(n_arms)]
    self.values = [0.0 for col in range(n_arms)]
    return


def select_arm(self):
    n_arms = len(self.counts)
    for arm in range(n_arms):
      if self.counts[arm] == 0:
        return arm

    ucb_values = [0.0 for arm in range(n_arms)]
    total_counts = sum(self.counts)
    for arm in range(n_arms):
      bonus = math.sqrt((2 * math.log(total_counts)) / float(self.counts[arm]))
      ucb_values[arm] = self.values[arm] + bonus
    return ind_max(ucb_values)

def update(self, chosen_arm, reward):
    self.counts[chosen_arm] = self.counts[chosen_arm] + 1
    n = self.counts[chosen_arm]

    value = self.values[chosen_arm]
    new_value = ((n - 1) / float(n)) * value + (1 / float(n)) * reward
    self.values[chosen_arm] = new_value
    return

-- Comparing Bandit Algorithms Side-by-Side

We can very clearly see how much noisier UCB1’s behavior looks than the epsilon-Greedy or Softmax algorithm’s.
We see that the epsilon-Greedy algorithm doesn’t converge as quickly as the Softmax algorithm. This might suggest that we need to use another annealing schedule or that this testbed is one in which the Softmax algorithm is simply superior to the epsilon-Greedy algorithm.
We see that UCB1 takes a while to catch up with the annealing Softmax algorithm, but that it does start to catch up right near the end of the plays we’ve simulated. In the exercises we encourage you to try other enviroments in which UCB1 might outperform Softmax unambigiously.

UCB1 finds the best arm very quickly, but the backpedaling it does causes it to underperform the Softmax algorithm along most metrics.

No comments:

Post a Comment

Blog Archive