This is the first in a series I’ll be doing on (deep) reinforcement learning where I’ll write about the topic and the interesting parts in a lightweight, easy-to-read format! A lot of this will be based off Sutton & Barto’s Reinforcement Learning book, and this particular post will be focusing on Chapter 2 from that book. Send any comments or corrections to

The Bandit Problem

The first time I heard about the bandit problem, I had just entered Carnegie Mellon University’s School of Computer Science. I knew next to nothing about the broader field of computer science. After I emailed the dean, Andrew Moore, asking for a bit of advice on finding my life direction, he very kindly set aside a bit of time in his undoubtedly busy schedule to talk with me one-on-one. He spoke about the transition from high school to college, and how one’s vision should appropriately broaden. He spoke about finding your niche, where you fit in and who you fit in with. He spoke about taking what he called technological risks - when you don’t know if something is even possible, but, knowing that you’re surrounded by the best minds in the field, you have a good chance of making something that was previously impossible, possible.

On the topic of a life direction, he introduced to me the bandit problem, which goes as follows: say you have a slot machine in front of you which has two levers - in contrast to normal slot machines, which have one lever and are often called one-armed bandits on account of their one lever. Say the two different levers of this two-armed bandit in front of you both make the slot machine spin and output some reward, but they do so differently so that pulling one lever or the other result in different payouts. Of course, nothing is for certain, so maybe the first lever has a higher average payout than the second one, or maybe the second one has a higher chance to give you nothing but also a higher chance to make you rich beyond your wildest dreams.

Unfortuately, you don’t know the statistical distributions of the payouts for each lever. But you want to get rich quick, and you only have enough money for, say, 100 lever pulls, so what do you do? One easy strategy is to pick a lever, and keep pulling that one. Maybe you’ll get lucky and pick the “better” lever, or maybe you’ll pick the “worse” lever. If you wanted to be smarter about it, you would sacrifice some initial payout and give each lever a couple pulls, just to see which one seems better, and once you had a good enough guess about which lever was better, spend the rest of your time only pulling that one. Hence, you spend some time in the exploration phase figuring out which lever is the best, and you spend the rest of your time in the exploitation phase, pulling the same lever and getting as much money as you can.1

It is important to note that the tasks of exploration and exploitation are conflicting - your goal is to get as much payout, or reward, as you can, and you get as much money as you can by exploitation. However, you might not know which strategy is best without exploration - exploring might make you try out unknown strategies to make sure that you’re not missing a potential goldmine. You can’t do just one and not the other - only exploring won’t pay off as much, and only exploiting might miss the best lever to pull. Finding the trade-off between the two is one of the most important parts of reinforcement learning.2

What exactly is reinforcement learning? Reinforcement learning is how an agent learns, by itself and by trying out different actions, which actions to take in various situations in order to maximize a reward. In fact, a reinforcement learning system has four main parts, a policy, which defines what actions the agent should take in a given situation, a reward signal, which gives a numerical representation of how well the agent is doing at the task or its goal, a value function, which specifies favorable states (where the potential for reward is high) and unfavorable states, and, optionally, a model of the environment, which can range from very simple to very complex and is quite often intractable.


Note: in this section, notation is kept consistent with Sutton & Barto’s formulations in Chapter 2 of Reinforcement Learning, an Introduction.

A k-armed bandit problem is defined as a situation where, at each time step, the agent has a choice from k different actions where each action results in a reward chosen from some unchanging probability distribution for that action. The agent aims to maximize the total reward gained over some fixed number of time steps, say, 100 or 1000. The analogy is to a bandit slot machine because each action can be likened to pulling a particular one out of the k levers of the slot machine and receiving the reward chosen from the appropriate distribution.

Let’s write this more formally - just like in deep learning, it is easy to read a lot of high-level discussion about reinforcement learning without really understanding anything - it is fairly simple, and writing the base formulations helps make it simple.

If we call the value of an action the mean reward when that action is taken - recall that the reward is sampled from a distribution and is rarely just a constant - and the action selected on time step \(t\) as \(A_t\) and the reward of that particular action as \(R_t\), we can write the value of an action \(a\) as the expected reward if \(a\) is taken:

\[q_* (a) = E[R_t \vert A_t = a]\]

However, because we don’t always know the true value of every action, we denote our best estimate of the value of action \(a\) as \(Q_t(a)\).

There are a couple ways of estimating \(Q_t(a)\) - one of the most basic is using the sample-average method, which is simply summing up all the rewards received after performing action \(a\) and dividing by the number of times action \(a\) was taken prior to the current time step \(t\).

\[Q_t(a) = \frac{\sum_{i = 1}^{t - 1} R_i \cdot \textbf{1}_{A_i = a}}{\sum_{i = 1}^{t - 1} \textbf{1}_{A_i = a}}\]

Where the bold \(\textbf{1}\) is just a random indicator variable that equals 1 if action \(a\) was taken on time step \(i\) and 0 otherwise, which just serves to make sure that we’re only working with the rewards when we actually took action \(a\).

If we wish to do a greedy action selection (i.e. picking the immediate best action) we just take the max estimated reward over all our actions and pick that one and call it \(A_t\).3

\[A_t \leftarrow \text{argmax}_a Q_t (a)\]

We can begin, now, to formally mesh exploration and exploitation. We want to be exploiting most of the time, so let’s define a small probability \(\varepsilon\) that we explore and select a random action, and the rest of the time, we exploit (with probability \(1-\varepsilon\)) and select the action with the highest estimated reward. We call this type of exploration-exploitation balance \(\varepsilon\)-greedy methods.

Updating with previous estimate

Now that we’re keeping track of all our estimates for action values \(Q_n\) after we’ve selected a given action \(n - 1\) times, we can show that for any \(n\), we can calculate \(Q_{n+1}\) at that step given only the current estimate \(Q_n\) and the current reward \(R_n\), rather than with all the previous rewards:

\[Q_n \stackrel{.}{=} \frac{R_1 + R_2 + \ldots + R_n}{n}\]


\[Q_{n + 1} = \frac{1}{n} \sum_{i = 1}^n R_i\] \[= \frac{1}{n}(R_n + \sum_{i = 1}^{n - 1} R_i)\] \[= \frac{1}{n}(R_n + (n-1)\frac{1}{n-1}\sum_{i = 1}^{n - 1} R_i)\] \[= \frac{1}{n}(R_n + (n-1)Q_n)\] \[= \frac{1}{n}(R_n + nQ_n - Q_n)\] \[= Q_n + \frac{1}{n}(R_n - Q_n)\]

This means that to calculate our new estimate, we just need our current estimate and the current reward! It’s also worth noting that the last equation is of the form

\[\text{New estimate} = \text{Old estimate} + \text{Step size} (\text{Target} - \text{Old estimate})\]

which intuitively makes sense - we want to be updating our estimate based off what our previous estimate was and how much the reality differs from our previous estimate, weighted by some learning factor.

My implementation

I’m working on my own basic implementation of \(\varepsilon\)-greedy methods on a 10-armed testbed where the true reward \(q_*(a)\) for each action is sampled from a normal distribution with mean 0 and variance 1, and the reward per action is sampled from a normal distribution with mean \(q_*(a)\) and variance 1. Stay tuned for results and my own plots - but for the meantime, Sutton & Barto have a good discussion of their sample results.

1: Andrew Moore said that I was still in the exploration phase, where my goal was to figure out what I wanted to do with my life and what I liked doing - the exploitation phase came later, when I would work at it as hard as I could.

2: Things get a bit more complicated once we make the payoffs for each lever change over time - what you thought was the optimal arm to pull might not be, after a while. But we’ll get into that later.

3: I use the pseudocode arrow notation for assignment here while Sutton & Barto use the \(\stackrel{.}{=}\) notation to represent a definition