Friday, March 11, 2016

Bandit algorithms 1 - Exploration and Exploitation

A Glossary

Reward
Arm
Bandit
Play/Trial
Horizon
Exploitation
Exploration
Explore/Exploit Dilemma
Annealing
Temperature
Streaming Algorithms
Online Learning
Active Learning
Bernoulli

The Explore-Exploit Dilemma

(A) learn about new ideas (which we’ll always call exploring from now on), while you also need to 

(B) take advantage of the best of your old ideas (which we’ll always call exploiting from now on).

Introduction to Bandits: Algorithms and Theory

The Multi-Armed Bandit problem is a statistical problem that seeks to optimize a gambler playing multiple slot machines. Each slot machine has an unknown probability of paying out, and the goal is to maximize the winnings over time. Slot machines, by the way, are sometimes called “one-armed bandits”, because-on average—you leave them with less money than you arrived with.

What is appealing about the Multi-Armed Bandit problem is that it is solvable. The basic “learn-and-proceed” algorithm works quite well. That is, for a certain period of time, play all the machines. Then continue to play the winning’est machine. Variations on the approach are also optimal, under a wide range of assumptions.

The Multi-Armed Bandit solution would seem to provide a path to site optimization. Start with a few variations. Treat each variation as a separate “slot machine”. Each time the page is about to be loaded, peer into the historical data on how well each version has done. Then choose the right version accordingly. Keep track of how good the page is. And, one of the optimization algorithms for the Multi-Armed Bandit problem can not only determine which page to load but guarantee optimal outcomes over time. Truly impressive.

This makes the testing problem much harder. If we know a handful of categories in advance, we do separate multi-armed bandit problems for each. For instance:

-        US-based, web-based, weekday working hours.
-        US-based, mobile, weekday working hours.

Through the 12 combinations suggested by these groups.

When the groups are not predetermined, it is a harder problem to solve. This is often approached using off-line analysis, if there is adequate test data.

Another issue is that some messages may be targeted at a specific population.  For instance, some ip addresses have registered as .EDU indicating educational institutions.  How do you handle market testing for these groups?

If there is only one group, then you can set up separate Multi-Armed Bandit-type tests for EDU and for non-EDU.  But this might get more complicated.  For instance, some messages might be targeted for web versus mobile.  And now you have overlap -- because EDU users are in both groups.

Yet another problem is that determining the outcome may take time.  If the outcome is as simple as clicking on an ad, then that is known within a few minutes.  If the outcome is purchasing a particular, well, not everyone puts something in a cart and immediately checks out.  You might want a day lag.  If the outcome is a longer term measurement (did site engagement go up? time to next visit?), then even longer periods are needed.  This can result in a very slow feed back loop for the Multi-Armed Bandit.  After all, a slot machine returns its winnings (or losings) very quickly.

What is the real solution to such adaptive marketing online?  The real solution requires a comprehensive solution.  This solution has to understand "natural" divisions online (such as organization type) and the fact that a given message may have differential effects in different areas.  This solution has to take into account that some messages are targeted to specific visitors or customers (if they can be known), so pure randomization may not be possible.  The solution has to understand the time dimension of the business, hopefully incorporating faster measures in the short-term, even if these are imperfect, along with validation of the measures in the longer term.

The Mutli-Armed bandit is a powerful idea.  Unfortunately, the solution to this problem is only one part of an online marketing solution.

In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and receives a reward from it. The agent aims at maximizing rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms.


It is the simplest setting where one encounters the exploration-exploitation dilemma.

No comments:

Post a Comment

Blog Archive